Skip to content

Commit

Permalink
process and threads
Browse files Browse the repository at this point in the history
  • Loading branch information
huataihuang committed Aug 7, 2023
1 parent cb6cc04 commit cdfa394
Show file tree
Hide file tree
Showing 19 changed files with 224 additions and 2 deletions.
47 changes: 47 additions & 0 deletions source/kernel/process/process_vs_thread.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,57 @@ Richard Stevens大师这样说过(大意):
- 线程有时候被称为轻量级进程,线程的创建速度比进程创建快10到100倍
- 一个进程的所有线程共享相同的全局内存,这使得线程间信息共享很容易,但是这种简单性带来的问题是同步问题

查看进程的线程
================

- 通过 ``/proc/<进程ID>/task/`` 下的文件名可以看到线程的 ``TID`` ,举例,检查 ``java`` 进程的线程:

.. literalinclude:: process_vs_thread/get_thread_tid
:caption: 通过 ``/proc/<进程ID>/task/`` 下的文件名获取线程 ``TID``

- 通过 ``ps`` 命令参数 ``-eLf`` 能够查看系统所有线程:

.. literalinclude:: process_vs_thread/ps_thread
:caption: ``ps`` 参数 ``-eLf`` 可以查看所有线程

这里参数含义:

- ``-L`` 显示线程,即 ``LWP`` 和 ``NLWP`` 列信息
- ``-e`` 显示所有进程(操作系统的所有进程,而不是仅仅当前用户)
- ``-f`` 采用 ``full-format`` 列出模式

输出显示类似:

.. literalinclude:: process_vs_thread/ps_thread_output
:caption: ``ps`` 参数 ``-eLf`` 输出案例
:emphasize-lines: 8,9

这里的:

- 第4列 ``LWP`` 表示轻量级进程 ``Light Weight Process`` ,也就是线程 ``TID``
- 第6列 ``NWLP`` 就是表示 ``Number of Threads`` (线程数量)

- 通过 ``pstree`` 命令

.. literalinclude:: process_vs_thread/pstree_thread
:caption: 使用 ``pstree`` 可以查看某个进程PID的所有线程

输出案例显示( :ref:`grafana` ):

.. literalinclude:: process_vs_thread/pstree_thread_output
:caption: 使用 ``pstree`` 可以查看grafana进程对应所有线程

- ``top`` 提供了观察线程数量以及排序的方法 :ref:`top_nth`

.. note::

对于系统中异常的线程数量,请检查 :ref:`thread_count` 以及是否存在线程泄露问题

参考
======

- `Linux多线程编程(不限Linux) <http://www.cnblogs.com/skynet/archive/2010/10/30/1865267.html>`_
- `Linux 线程实现机制分析 <https://www.ibm.com/developerworks/cn/linux/kernel/l-thread/>`_
- `Linux 线程模型的比较:LinuxThreads 和 NPTL <http://www.ibm.com/developerworks/cn/linux/l-threading.html>`_
- `知乎:Linux中进程和线程的开销基本一样啊,为什么还要多线程呢? <https://www.zhihu.com/question/19903801>`_
- `Solved: Check thread count per process in Linux [5 Methods] <https://www.golinuxcloud.com/check-threads-per-process-count-processes/>`_
1 change: 1 addition & 0 deletions source/kernel/process/process_vs_thread/get_thread_tid
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ls /proc/$(pidof java)/task/
1 change: 1 addition & 0 deletions source/kernel/process/process_vs_thread/ps_thread
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ps -eLf | less
10 changes: 10 additions & 0 deletions source/kernel/process/process_vs_thread/ps_thread_output
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
UID PID PPID LWP C NLWP STIME TTY TIME CMD
root 1 0 1 0 1 Aug06 ? 00:00:13 /sbin/init
root 2 0 2 0 1 Aug06 ? 00:00:00 [kthreadd]
root 3 2 3 0 1 Aug06 ? 00:00:00 [rcu_gp]
...
root 1984 2 1984 0 1 Aug06 ? 00:00:00 [nvidia]
root 1985 2 1985 0 1 Aug06 ? 00:00:03 [nv_queue]
grafana 2001 1 2001 0 57 Aug06 ? 00:00:00 /usr/share/grafana/bin/grafana server --config=/etc/grafana/grafana.ini --pidfile=/run/grafan
grafana 2001 1 2389 0 57 Aug06 ? 00:00:10 /usr/share/grafana/bin/grafana server --config=/etc/grafana/grafana.ini --pidfile=/run/grafan
...
2 changes: 2 additions & 0 deletions source/kernel/process/process_vs_thread/pstree_thread
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# 这里检查grafana进程的线程情况
pstree -pau -l -g -s 2001
16 changes: 16 additions & 0 deletions source/kernel/process/process_vs_thread/pstree_thread_output
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
systemd,1,1
└─grafana,2001,2001,grafana server --config=/etc/grafana/grafana.ini --pidfile=/run/grafana/grafana-server.pid --packaging=deb cfg:default.paths.logs=/var/log/grafana cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning
├─pcp_redis_datas,4012,2001
│ ├─{pcp_redis_datas},4013,2001
│ ├─{pcp_redis_datas},4014,2001
│ ├─{pcp_redis_datas},4015,2001
│ ├─{pcp_redis_datas},4016,2001
│ ├─{pcp_redis_datas},4018,2001
│ ├─{pcp_redis_datas},4019,2001
│ ├─{pcp_redis_datas},4020,2001
│ └─{pcp_redis_datas},4021,2001
├─{grafana},2389,2001
├─{grafana},2390,2001
├─{grafana},2391,2001
├─{grafana},2394,2001
...
61 changes: 61 additions & 0 deletions source/kernel/process/thread_count.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,67 @@
.. literalinclude:: thread_count/linux_os_threads_number
:caption: 获取Linux操作系统所有线程总数

排查线程数过度问题
=====================

生产环境出现线程过多问题,检查是哪个进程导致线程过多:

- 获取系统消耗线程最多的进程:

.. literalinclude:: thread_count/linux_os_max_threads_process
:caption: 获取Linux操作系统消耗最多线程的进程

.. literalinclude:: thread_count/linux_os_max_threads_process_output
:caption: 获取Linux操作系统消耗最多线程的进程

这个命令不太完善,不过可以看到 ``378199`` 进程消耗了过多的线程。实际上在 :ref:`top_nth` 也看到了这个消耗过多的进程:

.. literalinclude:: utils/top/top_nth_output_too_large
:caption: ``top`` 的 ``nTH`` 字段无法显示超过3位数值
:emphasize-lines: 8

进程允许的最大线程数量
=======================

- 操作系统级别允许每个进程 ``clone()`` 的线程数量可以从 ``procfs`` 获取:

.. literalinclude:: thread_count/kernel_threads-max
:caption: 通过 ``procfs`` 检查操作系统允许每个进程的最大线程数量

在我的 :ref:`ubuntu_linux` 22.04 上,默认每个进程最多允许大约 ``31w`` 线程

.. literalinclude:: thread_count/kernel_threads-max_output
:caption: 通过 ``procfs`` 可以看到操作系统允许每个进程最多31w线程

- 此外,可以通过 ``ulimits`` 检查每个用户允许发起的进程数量:

.. literalinclude:: thread_count/ulimits_processes
:caption: 通过 ``ulimits -a`` 可以检查当前用户允许的最大进程数量

可以看到每个用户允许的进程数量恰好是每个进程允许线程数量的一半,即 ``15.5w`` 进程:

.. literalinclude:: thread_count/ulimits_processes_output
:caption: 通过 ``ulimits -a`` 查当前用户允许的最大进程数量大约是 15.5w

.. note::

根据操作系统允许的每个用户的最大进程数量 ``15.5w`` ,乘以操作系统允许每个进程的最大线程数量 ``31w`` ,实际上每一个用户能够在操作系统发起的线程数量是惊人的 ``4.785`` 万亿个线程,差不多 **接近5万亿线程** 。不过,实际上,海量的线程会导致系统运行缓慢,所以我们需要在进程出现线程大量堆积的时候,及时排查故障解决软件bug。

- 操作系统级别允许的进程数量也可以从 ``procfs`` 中获取:

.. literalinclude:: thread_count/kernel_pid_max
:caption: 通过 ``procfs`` 检查操作系统允许的进程数量

在我的 :ref:`ubuntu_linux` 22.04 上,默认操作系统允许最大进程数量大约是 ``42w`` 进程:

.. literalinclude:: thread_count/kernel_pid_max_output
:caption: 通过 ``procfs`` 检查操作系统默认允许的进程总量大约是42w

该参数可以调整:

.. literalinclude:: thread_count/change_kernel_pid_max_output
:caption: 通过 ``sysctl`` 修改操作系统最大允许进程数量,例如修改成6.5w

参考
=====

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
echo kernel.pid_max = 65534 >> /etc/sysctl.conf
sysctl -p
1 change: 1 addition & 0 deletions source/kernel/process/thread_count/kernel_pid_max
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
cat /proc/sys/kernel/pid_max
1 change: 1 addition & 0 deletions source/kernel/process/thread_count/kernel_pid_max_output
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
4194304
1 change: 1 addition & 0 deletions source/kernel/process/thread_count/kernel_threads-max
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
cat /proc/sys/kernel/threads-max
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3093599
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ps -eo pid,command,nlwp | sort -n -k3
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
...
198370 /usr/local/bin/containerd-s 208
413302 /usr/local/bin/containerd-s 352
73275 sleep 86400 1
378199 /usr/local/bin/containerd-s 149804
1 change: 1 addition & 0 deletions source/kernel/process/thread_count/ulimits_processes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ulimit -a | grep -i processes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
max user processes (-u) 1546799
42 changes: 40 additions & 2 deletions source/kernel/process/utils/top.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,44 @@ top解读

当前时间( ``22:48:55`` ),系统启动时间( ``up 51 days, 8:47`` ),当前用户数量( ``1 user`` ),以及1分钟,5分钟和15分钟的 :ref:`` ``10.79, 10.45, 16.44``

-
.. _top_nth:

top检查线程数量
==================

``top`` 命令提供了一个 ``nTH`` 字段来显示进程的线程数量:

- 按下 ``f`` 进入 ``field`` 选择页面
- 使用上下键在字段上找到 ``nTH`` (Number of Threads),然后按下 ``空格键`` 表示选择显示
- 再按一下 ``s`` 键表示以 ``nTH`` 排序
- 再按一下 ``q`` 退出 ``field`` 选择页面

此时 ``top`` 就会完整输出每个进程的线程数量并且方便观察哪个进程的线程过多

举例:

.. literalinclude:: top/top_nth_output
:caption: 通过 ``nTH`` 字段在 ``top`` 中显示每个进程的线程数量

不过,实践也发现一个问题,如果一个进程的线程数量实在太多,超过了3位数值(999+)就无法完整在 ``nTH`` 显示,例如我在排查线上的一个线程数量过多告警:

.. literalinclude:: top/top_nth_output_too_large
:caption: ``top`` 的 ``nTH`` 字段无法显示超过3位数值
:emphasize-lines: 8

那么这个 ``378199`` PID实际有多少线程呢?

可以使用 ``ls`` 检查 ``/proc/<PID>/task`` 数量::

$ls /proc/378199/task | wc -l
149642

或者直接查看进程 ``status`` 中的 ``Threads`` 计数::

$cat /proc/378199/status | grep Threads
Threads:149705

可以看到这个 ``378199`` 进程的线程数量不断增加,这可能存在线程泄漏

参考
=======
Expand All @@ -42,4 +79,5 @@ top解读
- `Change top's sorting back to CPU <http://unix.stackexchange.com/questions/158584/change-tops-sorting-back-to-cpu>`_
- `12 TOP Command Examples in Linux <http://www.tecmint.com/12-top-command-examples-in-linux/>`_
- `Can You Top This? 15 Practical Linux Top Command Examples <http://www.thegeekstuff.com/2010/01/15-practical-unix-linux-top-command-examples>`_
- `top command in Linux with Examples <https://www.geeksforgeeks.org/top-command-in-linux-with-examples/>`_
- `top command in Linux with Examples <https://www.geeksforgeeks.org/top-command-in-linux-with-examples/>`_
- `Solved: Check thread count per process in Linux [5 Methods] <https://www.golinuxcloud.com/check-threads-per-process-count-processes/>`_
19 changes: 19 additions & 0 deletions source/kernel/process/utils/top/top_nth_output
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
top - 15:54:58 up 1 day, 4:53, 11 users, load average: 0.54, 0.63, 0.60
Tasks: 629 total, 1 running, 628 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 386813.0 total, 329241.6 free, 52052.1 used, 5519.3 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 332442.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND nTH
2001 grafana 20 0 5025464 144876 66116 S 0.0 0.0 3:46.33 grafana 57
46488 root 20 0 4700380 66800 48436 S 0.0 0.0 1:34.33 dockerd 54
2020 prometh+ 20 0 2214844 113796 55460 S 0.0 0.0 4:12.47 prometheus 53
2014 prometh+ 20 0 726728 21224 11968 S 0.0 0.0 14:22.54 node_exporter 52
2007 ipmi-ex+ 20 0 716884 17480 8512 S 0.0 0.0 0:41.42 ipmi_exporter 29
81266 huatai 20 0 2001676 10292 5252 S 0.0 0.0 0:02.28 apache2 27
81267 huatai 20 0 2001676 10644 5384 S 0.0 0.0 0:02.40 apache2 27
37897 root 20 0 2467684 50008 33528 S 0.0 0.0 1:43.43 containerd 24
57113 root 20 0 2018900 49400 33808 S 0.0 0.0 0:00.84 libvirtd 22
4012 grafana 20 0 718192 13124 10008 S 0.0 0.0 0:00.09 pcp_redis_datas 9
7300 libvirt+ 20 0 16.7g 16.0g 19668 S 7.3 4.2 131:08.50 qemu-system-x86 9
...
13 changes: 13 additions & 0 deletions source/kernel/process/utils/top/top_nth_output_too_large
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
top - 16:02:21 up 405 days, 5:56, 1 user, load average: 100.40, 94.44, 95.02
Tasks: 1654 total, 2 running, 1017 sleeping, 0 stopped, 19 zombie
%Cpu(s): 57.6 us, 15.1 sy, 0.0 ni, 27.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 79093344+total, 22562152 free, 34725468 used, 73364582+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 52038676 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND nTH
378199 root 20 0 355.1g 6.5g 5.6g S 66.5 0.9 343573:12 rund-c8fe00be 14+
413302 root 20 0 375.5g 350.1g 350.0g S 2654 46.4 262097,00 rund-10930310 352
423338 root 20 0 10.7g 166248 53324 S 13.0 0.0 4787:05 containerd 236
198370 root 20 0 31.2g 7.3g 7.2g S 110.8 1.0 470711:52 rund-822bb8d1 205
405808 root 20 0 168512 45700 10036 S 0.0 0.0 813:53.55 node_agent_k8s 168
...

0 comments on commit cdfa394

Please sign in to comment.