From 8e68ad997ce6c34b7822fd1a34d470b4e766d17e Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Mon, 22 Jun 2020 17:39:09 +0800 Subject: [PATCH 1/8] add tune-operating-system doc --- TOC.md | 2 + tune-operating-system.md | 120 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 122 insertions(+) create mode 100644 tune-operating-system.md diff --git a/TOC.md b/TOC.md index c9e469081bd06..aeb8cbdd2c244 100644 --- a/TOC.md +++ b/TOC.md @@ -88,6 +88,8 @@ + [Troubleshoot TiCDC](/ticdc/troubleshoot-ticdc.md) + [Troubleshoot TiFlash](/tiflash/troubleshoot-tiflash.md) + Performance Tuning + + System Tuning + + [Operating System Tuning](/tune-operating-system.md) + Software Tuning + Configuration + [TiKV Tuning](/tune-tikv-performance.md) diff --git a/tune-operating-system.md b/tune-operating-system.md new file mode 100644 index 0000000000000..ab32f93e4ee94 --- /dev/null +++ b/tune-operating-system.md @@ -0,0 +1,120 @@ +--- +title: Operating System Tuning +summary: Learn how to tune the parameters of the operating system. +category: reference +--- + +# Operating System Tuning + +This document introduces how to tune each subsystem of CentOS 7. + +> **Note:** +> +> + The default configuration of the CentOS 7 operating system is suitable for most services running under moderate workload. Adjusting the performance of a particular subsystem might negatively affects other subsystems. Therefore, before tuning the system, back up all user data and configuration information; +> + Fully test all the changes in the test environment before applying to the production environment. + +## Performance analysis methods + +System tuning must be based on the results of system performance analysis, so this document first lists common methods for performance analysis. + +### In 60 seconds + +[*Linux Performance Analysis in 60,000 Milliseconds*](http://www.brendangregg.com/Articles/Netflix_Linux_Perf_Analysis_60s.pdf) is published by the author Brendan Gregg and the Netflix Performance Engineering team. All used tools can be obtained from the official release of Linux. You can analyze outputs of the following list items to troubleshoot most common performance issues. + ++ `uptime` ++ `dmesg | tail` ++ `vmstat 1` ++ `mpstat -P ALL 1` ++ `pidstat 1` ++ `iostat -xz 1` ++ `free -m` ++ `sar -n DEV 1` ++ `sar -n TCP,ETCP 1` ++ `top` + +For detailed usage, see the Linux manual pages. + +### perf + +perf is an important performance analysis tool provided by the Linux kernel, which covers hardware level (CPU/PMU, performance monitoring unit) features and software features (software counters, trace points). For detailed usage, see [perf Examples](http://www.brendangregg.com/perf.html#Background). + +### BCC/bpftrace + +Starting from CentOS 7.6, the Linux kernel has supported Berkeley Packet Filter (BPF). Therefore, you can choose proper tools to conduct an in-depth analysis based on the results in [In 60 seconds](#in-60-seconds). Compared with perf/ftrace, BPF provides programmability and smaller performance overhead. Compared with kprobe, BPF provides higher security and is more suitable for the production environments. For detailed usage of the BCC toolkit, see [BPF Compiler Collection (BCC)](https://github.com/iovisor/bcc/blob/master/README.md). + +## Performance tuning + +This section introduces performance tuning based on the classified kernel subsystems. + +### CPU -- frequency scaling + +cpufreq is a module that dynamically adjusts the CPU frequency and supports five modes. To ensure service performance, select the performance mode and fix the CPU frequency at the highest supported operating frequency without dynamic adjustment. The command for this operation is `cpupower frequency-set --governor performance`. + +### CPU -- interrupt affinity + +- Automatic balance can be implemented through the `irqbalance` service. +- Manual balance: + - Identify the devices that need to balance interrupts. Starting from CentOS 7.5, the system automatically configures the best interrupt affinity for certain devices and their drivers, such as devices that use the `be2iscsi` driver and NVMe settings. You can no longer manually configure interrupt affinity for such devices. + - For other devices, check the chip manual to see whether these devices support distributing interrupts. + - If they do not, all interrupts of these devices are routed to the same CPU and cannot be modified. + - If they do, calculate the `smp_affinity` mask and set the corresponding configuration file. For details, see [kernel document](https://www.kernel.org/doc/Documentation/IRQ-affinity.txt). + +### NUMA CPU binding + +To avoid accessing memory across Non-Uniform Memory Access (NUMA) nodes as much as possible, you can bind a thread/process to certain CPU cores by setting the CPU affinity of the thread. For ordinary programs, you can use the `numactl` command for the CPU binding. For detailed usage, see the Linux manual pages. For network interface card (NIC) interrupts, see [tune network](#tune-network). + +### Memory -- transparent huge page (THP) + +It is **NOT** recommended to use THP for database applications, because databases often have sparse rather than continuous memory access patterns. If high-level memory fragmentation is serious, a higher latency will occur when THP pages are allocated. If the direct compaction is enabled for THP, the CPU usage will surge. Therefore, it is recommended to disable the direct compaction for THP. + +``` sh +echo never > /sys/kernel/mm/transparent_hugepage/enabled +echo never > /sys/kernel/mm/transparent_hugepage/defrag +``` + +### Memory -- virtual memory parameters + +- `dirty_ratio` percentage ratio. When the total amount of dirty page caches reach this percentage ratio of the total system memory, the system starts to use the `pdflush` operation to write the dirty page caches to disk. The default value of `dirty_ratio` is 20% and usually does not need adjustment. For high-performance SSDs such as NVMe devices, lowering this value helps improve the efficiency of memory reclamation. +- `dirty_background_ratio` percentage ratio. When the total amount of dirty page caches reach this percentage ratio of the total system memory, the system starts to write the dirty page caches to disk in the background. The default value of `dirty_ratio` is 10% and usually does not need adjustment. For high-performance SSDs such as NVMe devices, lower value helps improve the efficiency of memory reclamation. + +### Storage and file system + +The core I/O stack link is long, including the file system layer, the block device layer, and the driver layer. + +#### I/O scheduler + +The I/O scheduler determines when and how long I/O operations run on the storage device. It is also called I/O elevator. For SSD devices, it is recommended to set the I/O scheduling policy to noop. + +```sh +echo noop > /sys/block/${SSD_DEV_NAME}/queue/scheduler +``` + +#### Formatting parameters -- block size + +Blocks are the working units of the file system. The block size determines how much data can be stored in a single block, and thus determines the minimum amount of data to be written or read each time. + +The default block size is suitable for most scenarios. However, if the block size (or the size of multiple blocks) is the same or slightly larger than the amount of data normally read or written each time, the file system performs better and the data storage efficiency is higher. Small files still uses the entire block. Files can be distributed among multiple blocks, but this will increase runtime overhead. + +When using the `mkfs` command to format a device, specify the block size as a part of the file system options. The parameters that specify the block size vary with the file system. For details, see the corresponding mkfs manual pages. + +#### `mount` parameters + +If the `noatime` option is enabled in the `mount` command, the update of metadata is disabled when files are read. If the `nodiratime` behavior is enabled, the update of metadata is disabled when the directory is read. + +### Network tuning + +The network subsystem consists of many different parts with sensitive connections. The CentOS 7 network subsystem is designed to provide the best performance for most workloads and automatically optimizes the performance of these workloads. Therefore, usually you do not need to manually adjust network performance. + +Network issues are usually caused by issues of hardware or related devices. So before tuning the protocol stack, rule out hardware issues. + +Although the network stack is largely self-optimizing, the following aspects in the network packet processing might become the bottleneck and reduce performance: + +- NIC hardware cache: To correctly observe the packet loss at the hardware level, use the `ethtool -S ${NIC_DEV_NAME}` command to observe the `drops` field. When packet loss occurs, it might be that the processing speed of the hard/soft interrupts cannot follow the receiving speed of NIC. If the received buffer size is less than the upper limit, you can also try to increase the RX buffer to avoid packet loss. The query command is: `ethtool -g ${NIC_DEV_NAME}`, and the modification command is `ethtool -G ${NIC_DEV_NAME}`. +- Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU -- scale frequency](#cpu----frequency-scaling), [CPU -- interrupt affinity](#cpu----interrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, you can configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt) +- Soft interrupts: Observe the monitoring of `/proc/net/softnet\_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev\_budget` or `net.core.dev\_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. +- Receive queue of application sockets: Monitor the `Resv-q` column of `ss -nmp`. If the queue is full, consider increasing the size of the application socket cache or use the automatic cache adjustment method. In addition, consider whether you can optimize the architecture of the application layer and reduce the interval between reading sockets. +- Ethernet flow control: If the NIC and switch support the flow control feature, you can use this feature to give kernel some time to process the data in the NIC queue and to avoid the issue of NIC buffer overflow. +- Interrupts coalescing: Too frequent hardware interrupts reduces system performance, and too late hardware interrupts causes packet loss. Newer NICs support the interrupt coalescing feature and allows the driver to automatically adjust the number of hardware interrupts. You can execute `ethtool -c ${NIC_DEV_NAME}` to check and `ethtool -C ${NIC_DEV_NAME}` to enable this feature. The adaptive mode allows the NIC to automatically adjust the interrupt coalescing. In this mode, the driver checks the traffic mode and kernel receiving mode, and evaluates the coalescing settings in real time to prevent packet loss. NICs of different brands have different features and default configurations. For details, see the NIC manuals. +- Adapter queue: Before processing the protocol stack, the kernel uses this queue to buffer the data received by the NIC, and each CPU has its own backlog queue. The maximum number of packets that can be cached in this queue is `netdev\_max\_backlog`. Observe the second column of `/proc/net/softnet\_stat`. When the second column of a row continues to increase, it means that the CPU [row-1] queue is full and the data packet is lost. To resolve this problem, continue to double the `net.core.netdev \_max\_backlog` value. +- Send queue: The length value of a send queue determines the number of packets that can be queued before sending. The default value is `1000`, which is sufficient for 10 Gbps. +- Driver: NIC drivers usually provide tuning parameters. See the device hardware manual and its driver documentation. From 5b127033c7952a6f6b3f352c2fd7e355a898eec3 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Mon, 22 Jun 2020 21:19:45 +0800 Subject: [PATCH 2/8] Apply suggestions from code review --- tune-operating-system.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tune-operating-system.md b/tune-operating-system.md index ab32f93e4ee94..6045597c19d0e 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -111,10 +111,10 @@ Although the network stack is largely self-optimizing, the following aspects in - NIC hardware cache: To correctly observe the packet loss at the hardware level, use the `ethtool -S ${NIC_DEV_NAME}` command to observe the `drops` field. When packet loss occurs, it might be that the processing speed of the hard/soft interrupts cannot follow the receiving speed of NIC. If the received buffer size is less than the upper limit, you can also try to increase the RX buffer to avoid packet loss. The query command is: `ethtool -g ${NIC_DEV_NAME}`, and the modification command is `ethtool -G ${NIC_DEV_NAME}`. - Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU -- scale frequency](#cpu----frequency-scaling), [CPU -- interrupt affinity](#cpu----interrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, you can configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt) -- Soft interrupts: Observe the monitoring of `/proc/net/softnet\_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev\_budget` or `net.core.dev\_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. +- Software interrupts: Observe the monitoring of `/proc/net/softnet\_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev\_budget` or `net.core.dev\_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. - Receive queue of application sockets: Monitor the `Resv-q` column of `ss -nmp`. If the queue is full, consider increasing the size of the application socket cache or use the automatic cache adjustment method. In addition, consider whether you can optimize the architecture of the application layer and reduce the interval between reading sockets. - Ethernet flow control: If the NIC and switch support the flow control feature, you can use this feature to give kernel some time to process the data in the NIC queue and to avoid the issue of NIC buffer overflow. -- Interrupts coalescing: Too frequent hardware interrupts reduces system performance, and too late hardware interrupts causes packet loss. Newer NICs support the interrupt coalescing feature and allows the driver to automatically adjust the number of hardware interrupts. You can execute `ethtool -c ${NIC_DEV_NAME}` to check and `ethtool -C ${NIC_DEV_NAME}` to enable this feature. The adaptive mode allows the NIC to automatically adjust the interrupt coalescing. In this mode, the driver checks the traffic mode and kernel receiving mode, and evaluates the coalescing settings in real time to prevent packet loss. NICs of different brands have different features and default configurations. For details, see the NIC manuals. +- Interrupts coalescing: Too frequent hardware interrupts reduces system performance, and too late hardware interrupts causes packet loss. Newer NICs support the interrupt coalescing feature and allow the driver to automatically adjust the number of hardware interrupts. You can execute `ethtool -c ${NIC_DEV_NAME}` to check and `ethtool -C ${NIC_DEV_NAME}` to enable this feature. The adaptive mode allows the NIC to automatically adjust the interrupt coalescing. In this mode, the driver checks the traffic mode and kernel receiving mode, and evaluates the coalescing settings in real time to prevent packet loss. NICs of different brands have different features and default configurations. For details, see the NIC manuals. - Adapter queue: Before processing the protocol stack, the kernel uses this queue to buffer the data received by the NIC, and each CPU has its own backlog queue. The maximum number of packets that can be cached in this queue is `netdev\_max\_backlog`. Observe the second column of `/proc/net/softnet\_stat`. When the second column of a row continues to increase, it means that the CPU [row-1] queue is full and the data packet is lost. To resolve this problem, continue to double the `net.core.netdev \_max\_backlog` value. - Send queue: The length value of a send queue determines the number of packets that can be queued before sending. The default value is `1000`, which is sufficient for 10 Gbps. - Driver: NIC drivers usually provide tuning parameters. See the device hardware manual and its driver documentation. From 5246b4ebd3b32681e2c206cb449bc3b268fe0aa2 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 30 Jun 2020 20:15:57 +0800 Subject: [PATCH 3/8] Apply suggestions from code review Co-authored-by: Lilian Lee --- tune-operating-system.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/tune-operating-system.md b/tune-operating-system.md index 6045597c19d0e..cda4d9751bf2b 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -1,7 +1,7 @@ --- title: Operating System Tuning summary: Learn how to tune the parameters of the operating system. -category: reference +category: tuning --- # Operating System Tuning @@ -10,16 +10,16 @@ This document introduces how to tune each subsystem of CentOS 7. > **Note:** > -> + The default configuration of the CentOS 7 operating system is suitable for most services running under moderate workload. Adjusting the performance of a particular subsystem might negatively affects other subsystems. Therefore, before tuning the system, back up all user data and configuration information; -> + Fully test all the changes in the test environment before applying to the production environment. +> + The default configuration of the CentOS 7 operating system is suitable for most services running under moderate workloads. Adjusting the performance of a particular subsystem might negatively affects other subsystems. Therefore, before tuning the system, back up all the user data and configuration information. +> + Fully test all the changes in the test environment before applying them to the production environment. ## Performance analysis methods -System tuning must be based on the results of system performance analysis, so this document first lists common methods for performance analysis. +System tuning must be based on the results of system performance analysis. This section lists common methods for performance analysis. ### In 60 seconds -[*Linux Performance Analysis in 60,000 Milliseconds*](http://www.brendangregg.com/Articles/Netflix_Linux_Perf_Analysis_60s.pdf) is published by the author Brendan Gregg and the Netflix Performance Engineering team. All used tools can be obtained from the official release of Linux. You can analyze outputs of the following list items to troubleshoot most common performance issues. +[*Linux Performance Analysis in 60,000 Milliseconds*](http://www.brendangregg.com/Articles/Netflix_Linux_Perf_Analysis_60s.pdf) is published by the author Brendan Gregg and the Netflix Performance Engineering team. All tools used can be obtained from the official release of Linux. You can analyze outputs of the following list items to troubleshoot most common performance issues. + `uptime` + `dmesg | tail` @@ -46,36 +46,36 @@ Starting from CentOS 7.6, the Linux kernel has supported Berkeley Packet Filter This section introduces performance tuning based on the classified kernel subsystems. -### CPU -- frequency scaling +### CPU—frequency scaling -cpufreq is a module that dynamically adjusts the CPU frequency and supports five modes. To ensure service performance, select the performance mode and fix the CPU frequency at the highest supported operating frequency without dynamic adjustment. The command for this operation is `cpupower frequency-set --governor performance`. +cpufreq is a module that dynamically adjusts the CPU frequency. It supports five modes. To ensure service performance, select the performance mode and fix the CPU frequency at the highest supported operating frequency without dynamic adjustment. The command for this operation is `cpupower frequency-set --governor performance`. -### CPU -- interrupt affinity +### CPU—interrupt affinity - Automatic balance can be implemented through the `irqbalance` service. - Manual balance: - Identify the devices that need to balance interrupts. Starting from CentOS 7.5, the system automatically configures the best interrupt affinity for certain devices and their drivers, such as devices that use the `be2iscsi` driver and NVMe settings. You can no longer manually configure interrupt affinity for such devices. - For other devices, check the chip manual to see whether these devices support distributing interrupts. - If they do not, all interrupts of these devices are routed to the same CPU and cannot be modified. - - If they do, calculate the `smp_affinity` mask and set the corresponding configuration file. For details, see [kernel document](https://www.kernel.org/doc/Documentation/IRQ-affinity.txt). + - If they do, calculate the `smp_affinity` mask and set the corresponding configuration file. For details, see the [kernel document](https://www.kernel.org/doc/Documentation/IRQ-affinity.txt). ### NUMA CPU binding To avoid accessing memory across Non-Uniform Memory Access (NUMA) nodes as much as possible, you can bind a thread/process to certain CPU cores by setting the CPU affinity of the thread. For ordinary programs, you can use the `numactl` command for the CPU binding. For detailed usage, see the Linux manual pages. For network interface card (NIC) interrupts, see [tune network](#tune-network). -### Memory -- transparent huge page (THP) +### Memory—transparent huge page (THP) -It is **NOT** recommended to use THP for database applications, because databases often have sparse rather than continuous memory access patterns. If high-level memory fragmentation is serious, a higher latency will occur when THP pages are allocated. If the direct compaction is enabled for THP, the CPU usage will surge. Therefore, it is recommended to disable the direct compaction for THP. +It is **NOT** recommended to use THP for database applications, because databases often have sparse rather than continuous memory access patterns. If high-level memory fragmentation is serious, a higher latency will occur when THP pages are allocated. If the direct compaction is enabled for THP, the CPU usage will surge. Therefore, it is recommended to disable THP. -``` sh +```sh echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag ``` -### Memory -- virtual memory parameters +### Memory—virtual memory parameters - `dirty_ratio` percentage ratio. When the total amount of dirty page caches reach this percentage ratio of the total system memory, the system starts to use the `pdflush` operation to write the dirty page caches to disk. The default value of `dirty_ratio` is 20% and usually does not need adjustment. For high-performance SSDs such as NVMe devices, lowering this value helps improve the efficiency of memory reclamation. -- `dirty_background_ratio` percentage ratio. When the total amount of dirty page caches reach this percentage ratio of the total system memory, the system starts to write the dirty page caches to disk in the background. The default value of `dirty_ratio` is 10% and usually does not need adjustment. For high-performance SSDs such as NVMe devices, lower value helps improve the efficiency of memory reclamation. +- `dirty_background_ratio` percentage ratio. When the total amount of dirty page caches reach this percentage ratio of the total system memory, the system starts to write the dirty page caches to the disk in the background. The default value of `dirty_ratio` is 10% and usually does not need adjustment. For high-performance SSDs such as NVMe devices, setting a lower value helps improve the efficiency of memory reclamation. ### Storage and file system @@ -89,13 +89,13 @@ The I/O scheduler determines when and how long I/O operations run on the storage echo noop > /sys/block/${SSD_DEV_NAME}/queue/scheduler ``` -#### Formatting parameters -- block size +#### Formatting parameters—block size Blocks are the working units of the file system. The block size determines how much data can be stored in a single block, and thus determines the minimum amount of data to be written or read each time. -The default block size is suitable for most scenarios. However, if the block size (or the size of multiple blocks) is the same or slightly larger than the amount of data normally read or written each time, the file system performs better and the data storage efficiency is higher. Small files still uses the entire block. Files can be distributed among multiple blocks, but this will increase runtime overhead. +The default block size is suitable for most scenarios. However, if the block size (or the size of multiple blocks) is the same or slightly larger than the amount of data normally read or written each time, the file system performs better and the data storage efficiency is higher. Small files still uses the entire block. Files can be distributed among multiple blocks, but this will increase runtime overhead. -When using the `mkfs` command to format a device, specify the block size as a part of the file system options. The parameters that specify the block size vary with the file system. For details, see the corresponding mkfs manual pages. +When using the `mkfs` command to format a device, specify the block size as a part of the file system options. The parameters that specify the block size vary with the file system. For details, see the corresponding `mkfs` manual pages, such as using `man mkfs.ext4`. #### `mount` parameters From f57e986e8de1d11e5d791017ce530e0542a358ff Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 30 Jun 2020 20:18:26 +0800 Subject: [PATCH 4/8] Update tune-operating-system.md --- tune-operating-system.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tune-operating-system.md b/tune-operating-system.md index cda4d9751bf2b..f0dba6f04dfc0 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -110,7 +110,7 @@ Network issues are usually caused by issues of hardware or related devices. So b Although the network stack is largely self-optimizing, the following aspects in the network packet processing might become the bottleneck and reduce performance: - NIC hardware cache: To correctly observe the packet loss at the hardware level, use the `ethtool -S ${NIC_DEV_NAME}` command to observe the `drops` field. When packet loss occurs, it might be that the processing speed of the hard/soft interrupts cannot follow the receiving speed of NIC. If the received buffer size is less than the upper limit, you can also try to increase the RX buffer to avoid packet loss. The query command is: `ethtool -g ${NIC_DEV_NAME}`, and the modification command is `ethtool -G ${NIC_DEV_NAME}`. -- Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU -- scale frequency](#cpu----frequency-scaling), [CPU -- interrupt affinity](#cpu----interrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, you can configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt) +- Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU—scale frequency](#cpufrequency-scaling), [CPU—interrupt affinity](#cpuinterrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, you can configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt) - Software interrupts: Observe the monitoring of `/proc/net/softnet\_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev\_budget` or `net.core.dev\_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. - Receive queue of application sockets: Monitor the `Resv-q` column of `ss -nmp`. If the queue is full, consider increasing the size of the application socket cache or use the automatic cache adjustment method. In addition, consider whether you can optimize the architecture of the application layer and reduce the interval between reading sockets. - Ethernet flow control: If the NIC and switch support the flow control feature, you can use this feature to give kernel some time to process the data in the NIC queue and to avoid the issue of NIC buffer overflow. From 44d9a7f3210163295a172787a5184e3ac00015b0 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 1 Jul 2020 15:37:34 +0800 Subject: [PATCH 5/8] Update tune-operating-system.md Co-authored-by: Lilian Lee --- tune-operating-system.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tune-operating-system.md b/tune-operating-system.md index f0dba6f04dfc0..de65f49d5d7c5 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -32,7 +32,7 @@ System tuning must be based on the results of system performance analysis. This + `sar -n TCP,ETCP 1` + `top` -For detailed usage, see the Linux manual pages. +For detailed usage, see the corresponding `man` instructions. ### perf From 689e5b8c2c6603ff574ffff26d96b6b50a35d872 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 1 Jul 2020 15:39:11 +0800 Subject: [PATCH 6/8] Update tune-operating-system.md --- tune-operating-system.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tune-operating-system.md b/tune-operating-system.md index de65f49d5d7c5..11efb667b94c7 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -93,7 +93,7 @@ echo noop > /sys/block/${SSD_DEV_NAME}/queue/scheduler Blocks are the working units of the file system. The block size determines how much data can be stored in a single block, and thus determines the minimum amount of data to be written or read each time. -The default block size is suitable for most scenarios. However, if the block size (or the size of multiple blocks) is the same or slightly larger than the amount of data normally read or written each time, the file system performs better and the data storage efficiency is higher. Small files still uses the entire block. Files can be distributed among multiple blocks, but this will increase runtime overhead. +The default block size is suitable for most scenarios. However, if the block size (or the size of multiple blocks) is the same or slightly larger than the amount of data normally read or written each time, the file system performs better and the data storage efficiency is higher. Small files still use the entire block. Files can be distributed among multiple blocks, but this will increase runtime overhead. When using the `mkfs` command to format a device, specify the block size as a part of the file system options. The parameters that specify the block size vary with the file system. For details, see the corresponding `mkfs` manual pages, such as using `man mkfs.ext4`. From fc7f0684cd1af4e446908cf74d6a51357c5614a1 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:52:14 +0800 Subject: [PATCH 7/8] address comments from lilian Co-authored-by: Lilian Lee --- tune-operating-system.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/tune-operating-system.md b/tune-operating-system.md index 11efb667b94c7..512a1b72d6ae8 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -107,14 +107,14 @@ The network subsystem consists of many different parts with sensitive connection Network issues are usually caused by issues of hardware or related devices. So before tuning the protocol stack, rule out hardware issues. -Although the network stack is largely self-optimizing, the following aspects in the network packet processing might become the bottleneck and reduce performance: +Although the network stack is largely self-optimizing, the following aspects in the network packet processing might become the bottleneck and affect performance: -- NIC hardware cache: To correctly observe the packet loss at the hardware level, use the `ethtool -S ${NIC_DEV_NAME}` command to observe the `drops` field. When packet loss occurs, it might be that the processing speed of the hard/soft interrupts cannot follow the receiving speed of NIC. If the received buffer size is less than the upper limit, you can also try to increase the RX buffer to avoid packet loss. The query command is: `ethtool -g ${NIC_DEV_NAME}`, and the modification command is `ethtool -G ${NIC_DEV_NAME}`. -- Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU—scale frequency](#cpufrequency-scaling), [CPU—interrupt affinity](#cpuinterrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, you can configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt) -- Software interrupts: Observe the monitoring of `/proc/net/softnet\_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev\_budget` or `net.core.dev\_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. +- NIC hardware cache: To correctly observe the packet loss at the hardware level, use the `ethtool -S ${NIC_DEV_NAME}` command to observe the `drops` field. When packet loss occurs, it might be that the processing speed of the hard/soft interrupts cannot catch up with the receiving speed of NIC. If the received buffer size is less than the upper limit, you can also try to increase the RX buffer to avoid packet loss. The query command is: `ethtool -g ${NIC_DEV_NAME}`, and the modification command is `ethtool -G ${NIC_DEV_NAME}`. +- Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU—frequency scaling](#cpufrequency-scaling), [CPU—interrupt affinity](#cpuinterrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see the [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt). +- Software interrupts: Observe the monitoring of `/proc/net/softnet_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev_budget` or `net.core.dev_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. - Receive queue of application sockets: Monitor the `Resv-q` column of `ss -nmp`. If the queue is full, consider increasing the size of the application socket cache or use the automatic cache adjustment method. In addition, consider whether you can optimize the architecture of the application layer and reduce the interval between reading sockets. -- Ethernet flow control: If the NIC and switch support the flow control feature, you can use this feature to give kernel some time to process the data in the NIC queue and to avoid the issue of NIC buffer overflow. +- Ethernet flow control: If the NIC and switch support the flow control feature, you can use this feature to leave some time for the kernel to process the data in the NIC queue, to avoid the issue of NIC buffer overflow. - Interrupts coalescing: Too frequent hardware interrupts reduces system performance, and too late hardware interrupts causes packet loss. Newer NICs support the interrupt coalescing feature and allow the driver to automatically adjust the number of hardware interrupts. You can execute `ethtool -c ${NIC_DEV_NAME}` to check and `ethtool -C ${NIC_DEV_NAME}` to enable this feature. The adaptive mode allows the NIC to automatically adjust the interrupt coalescing. In this mode, the driver checks the traffic mode and kernel receiving mode, and evaluates the coalescing settings in real time to prevent packet loss. NICs of different brands have different features and default configurations. For details, see the NIC manuals. -- Adapter queue: Before processing the protocol stack, the kernel uses this queue to buffer the data received by the NIC, and each CPU has its own backlog queue. The maximum number of packets that can be cached in this queue is `netdev\_max\_backlog`. Observe the second column of `/proc/net/softnet\_stat`. When the second column of a row continues to increase, it means that the CPU [row-1] queue is full and the data packet is lost. To resolve this problem, continue to double the `net.core.netdev \_max\_backlog` value. -- Send queue: The length value of a send queue determines the number of packets that can be queued before sending. The default value is `1000`, which is sufficient for 10 Gbps. +- Adapter queue: Before processing the protocol stack, the kernel uses this queue to buffer the data received by the NIC, and each CPU has its own backlog queue. The maximum number of packets that can be cached in this queue is `netdev_max_backlog`. Observe the second column of `/proc/net/softnet_stat`. When the second column of a row continues to increase, it means that the CPU [row-1] queue is full and the data packet is lost. To resolve this problem, continue to double the `net.core.netdev_max_backlog` value. +- Send queue: The length value of a send queue determines the number of packets that can be queued before sending. The default value is `1000`, which is sufficient for 10 Gbps. But if you have observed the value of TX errors from the output of `ip -s link`, you can try to double it: `ip link set dev ${NIC_DEV_NAME} txqueuelen 2000`. - Driver: NIC drivers usually provide tuning parameters. See the device hardware manual and its driver documentation. From 38ca442329a227256de67a566bba782c44897d6c Mon Sep 17 00:00:00 2001 From: Lilian Lee Date: Thu, 2 Jul 2020 17:01:07 +0800 Subject: [PATCH 8/8] Add several blank lines to make it not so crowded --- tune-operating-system.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/tune-operating-system.md b/tune-operating-system.md index 512a1b72d6ae8..ba4a24e6ce497 100644 --- a/tune-operating-system.md +++ b/tune-operating-system.md @@ -110,11 +110,19 @@ Network issues are usually caused by issues of hardware or related devices. So b Although the network stack is largely self-optimizing, the following aspects in the network packet processing might become the bottleneck and affect performance: - NIC hardware cache: To correctly observe the packet loss at the hardware level, use the `ethtool -S ${NIC_DEV_NAME}` command to observe the `drops` field. When packet loss occurs, it might be that the processing speed of the hard/soft interrupts cannot catch up with the receiving speed of NIC. If the received buffer size is less than the upper limit, you can also try to increase the RX buffer to avoid packet loss. The query command is: `ethtool -g ${NIC_DEV_NAME}`, and the modification command is `ethtool -G ${NIC_DEV_NAME}`. + - Hardware interrupts: If the NIC supports the Receive-Side Scaling (RSS, also called multi-NIC receiving) feature, observe the `/proc/interrupts` NIC interrupts. If the interrupts are uneven, see [CPU—frequency scaling](#cpufrequency-scaling), [CPU—interrupt affinity](#cpuinterrupt-affinity), and [NUMA CPU binding](#numa-cpu-binding). If the NIC does not support RSS or the number of RSS is much smaller than the number of physical CPU cores, configure Receive Packet Steering (RPS, which can be regarded as the software implementation of RSS), and the RPS extension Receive Flow Steering (RFS). For detailed configuration, see the [kernel document](https://www.kernel.org/doc/Documentation/networking/scaling.txt). + - Software interrupts: Observe the monitoring of `/proc/net/softnet_stat`. If the values of the other columns except the third column are increasing, properly adjust the value of `net.core.netdev_budget` or `net.core.dev_weight` for `softirq` to get more CPU time. In addition, you also need to check the CPU usage to determine which tasks are frequently using the CPU and whether they can be optimized. + - Receive queue of application sockets: Monitor the `Resv-q` column of `ss -nmp`. If the queue is full, consider increasing the size of the application socket cache or use the automatic cache adjustment method. In addition, consider whether you can optimize the architecture of the application layer and reduce the interval between reading sockets. + - Ethernet flow control: If the NIC and switch support the flow control feature, you can use this feature to leave some time for the kernel to process the data in the NIC queue, to avoid the issue of NIC buffer overflow. + - Interrupts coalescing: Too frequent hardware interrupts reduces system performance, and too late hardware interrupts causes packet loss. Newer NICs support the interrupt coalescing feature and allow the driver to automatically adjust the number of hardware interrupts. You can execute `ethtool -c ${NIC_DEV_NAME}` to check and `ethtool -C ${NIC_DEV_NAME}` to enable this feature. The adaptive mode allows the NIC to automatically adjust the interrupt coalescing. In this mode, the driver checks the traffic mode and kernel receiving mode, and evaluates the coalescing settings in real time to prevent packet loss. NICs of different brands have different features and default configurations. For details, see the NIC manuals. + - Adapter queue: Before processing the protocol stack, the kernel uses this queue to buffer the data received by the NIC, and each CPU has its own backlog queue. The maximum number of packets that can be cached in this queue is `netdev_max_backlog`. Observe the second column of `/proc/net/softnet_stat`. When the second column of a row continues to increase, it means that the CPU [row-1] queue is full and the data packet is lost. To resolve this problem, continue to double the `net.core.netdev_max_backlog` value. + - Send queue: The length value of a send queue determines the number of packets that can be queued before sending. The default value is `1000`, which is sufficient for 10 Gbps. But if you have observed the value of TX errors from the output of `ip -s link`, you can try to double it: `ip link set dev ${NIC_DEV_NAME} txqueuelen 2000`. + - Driver: NIC drivers usually provide tuning parameters. See the device hardware manual and its driver documentation.