Skip to content

Commit

Permalink
smartmon
Browse files Browse the repository at this point in the history
  • Loading branch information
huataihuang committed Aug 23, 2023
1 parent d03b964 commit 2a5cba7
Show file tree
Hide file tree
Showing 30 changed files with 706 additions and 6 deletions.
2 changes: 1 addition & 1 deletion source/clang/upgrade_developer_toolset_on_centos7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ make
.. literalinclude:: upgrade_developer_toolset_on_centos7/build_make
:caption: 升级make

- 配置 :ref:`parallel_make` ( ``~/.bashrc`` ):
- 配置 :ref:`parallel_make` ( ``~/.bash_profile`` ):

.. literalinclude:: parallel_make/make_j

Expand Down
1 change: 1 addition & 0 deletions source/linux/storage/disk/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Linux磁盘
parted.rst
mount_img.rst
intel_ssd_dc_series.rst
update_intel_545s_ssd_firmware.rst
using_apple_superdrive_on_linux.rst
sandisk_cloudspeed_eco_gen_ii_sata_ssd.rst

Expand Down
191 changes: 191 additions & 0 deletions source/linux/storage/disk/smart_monitor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,202 @@
.. literalinclude:: smart_monitor/dmesg_ssd_error
:caption: ``dmesg`` 中SSD磁盘错误日志

.. note::

我感觉这个 ``Intel 545s Series SSDs`` 的firmware可能存在问题,参考 `Latest Firmware For Solidigm™ (Formerly Intel®) Solid State Drives <https://www.solidigmtech.com.cn/support-page/product-doc-cert/ka-00099.html>`_ 可以看到这款 ``Intel 545s Series SSDs`` 最新的firmware 是 ``004C`` (针对512GB) 和 ``0B3C`` (针对1TB) 。我准备做一次firmware升级来尝试修复这个reset问题。

我想通过存储的 S.M.A.R.T. 技术来检测和监视磁盘的异常:

- 本文的 ``smartctl`` 命令行检查(基础能力)
- :ref:`node_exporter_smartctl_text_plugin` 通过自己部署的 Prometheus + Grafana 监控来直观观察

安装 ``smartmontools``
========================

- 在 :ref:`ubuntu_linux` 环境使用 :ref:`apt` 安装:

.. literalinclude:: smart_monitor/apt_smartmontools
:caption: 在Ubuntu安装 ``smartmontools``

SMART info
=============

- 检查磁盘设备是否支持和激活SMART:

.. literalinclude:: smart_monitor/smartctl_info
:caption: ``smartctl -i`` 检查磁盘info信息

我的 :ref:`sandisk_cloudspeed_eco_gen_ii_sata_ssd` SMART 信息如下:

.. literalinclude:: smart_monitor/smartctl_info_sandisk
:caption: ``smartctl -i`` 检查Sandisk SSD磁盘info信息

.. literalinclude:: smart_monitor/smartctl_info_intel
:caption: ``smartctl -i`` 检查Intel SSD磁盘info信息

SMART test
============

SMART提供 **两种** 不同的测试:

- Background Mode(后台模式): 后台测试的优先级低,也就是说硬盘仍然会处理常规指令。如果硬盘繁忙,则测试会暂停并且以低负载速度进行,这样不会中断硬盘工作
- Foreground Mode(前台模式): 测试采用了 ``CHECK CONDITION`` 状态必须响应,这种模式只能在不使用的硬盘上进行。

根据经验, **建议采用后台模式**

ATA/SCSI(共有的)测试
-----------------------

Short Test
~~~~~~~~~~~~

``短测试`` 的目的是快速识别有缺陷的硬盘驱动器。因此,短测试的最大持续实践大约2分钟。该测试将磁盘氛围3个不同阶段来检查:

- **Electrical Properties** (电气特性): 控制器测试自己的的电子电路,由于这个测试是每个制造商特有的,因此无法确切解释正在测试的内容。例如测试内部RAM,读写电路或磁头电子器件
- **Mechanical Properties** (机械特性): 测试伺服系统和定位机构的确切顺序也因每个制造商而异
- **Read/Verify** (读取/验证): 读取磁盘的某个区域并验证某些数据,读取的区域的大小和位置也是每个制造商特定的

Long Test
~~~~~~~~~~~~~~

``长测试`` 被设计成生产中的最终测试,与短测试相同,但有 **2点区别** :

- 长测试没有时间限制
- 长测试会 **Read/Verify** (读取/验证) 整个磁盘而不仅仅是一小部分

ATA特有的测试
-------------------

运输测试(Conveyance Tests)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

运输测试(Conveyance Test)可以在短短几分钟内确定硬盘在运输过程中的损坏情况

选择测试(Select Tests)
~~~~~~~~~~~~~~~~~~~~~~~

选择测试可以指定LBA范围,即只扫描指定的LBA区域:

.. literalinclude:: smart_monitor/smartctl_select_tests
:language: bash
:caption: 指定LBA进行扫描

而且可以指定多个范围(最多5个)进行扫描:

.. literalinclude:: smart_monitor/smartctl_select_tests_multi
:language: bash
:caption: 指定多个LBA范围进行扫描

使用 ``smartctl`` 测试
========================

检查存储设备SMART能力
-----------------------

- 在测试前,可以预估一下不同测试所需时间:

.. literalinclude:: smart_monitor/smartctl_capabilities_sda
:caption: ``smartctl`` 检查存储设备能力,可以看到预估测试时间

可以看到 ``/dev/sda`` ( :ref:`sandisk_cloudspeed_eco_gen_ii_sata_ssd` )预估测试时间:

.. literalinclude:: smart_monitor/smartctl_capabilities_sda_output
:caption: ``smartctl`` 检查存储 :ref:`sandisk_cloudspeed_eco_gen_ii_sata_ssd` 设备能力,可以看到预估测试时间
:emphasize-lines: 25-28

我的另一个磁盘 ``/dev/sdb`` ( Intel 545s系列 ):

.. literalinclude:: smart_monitor/smartctl_capabilities_sdb_output
:caption: ``smartctl`` 检查存储 Intel 545s系列SSD 设备能力,可以看到预估测试时间
:emphasize-lines: 25-28

测试
--------

``/dev/sda``
~~~~~~~~~~~~~~

- 执行测试(long test):

.. literalinclude:: smart_monitor/smartctl_long_tests_sda
:caption: ``smartctl`` 对sda进行长测试,注意参数结合 ``-C`` 表示Foreground Mode

- 长测试输出信息

.. literalinclude:: smart_monitor/smartctl_long_tests_sda_output
:caption: ``smartctl`` 对sda进行长测试的输出信息
:emphasize-lines: 8,9

可以看到这个 :ref:`sandisk_cloudspeed_eco_gen_ii_sata_ssd` 仅需要1分钟就能完成长测试 ( 搞笑? 这个长测试和短测试的时间是一样的,不会是虚假吧 )

- 查看测试结果( ``-a`` 参数 ):

.. literalinclude:: smart_monitor/smartctl_view_sda_test_result
:caption: ``smartctl`` 查看sda测试结果

.. literalinclude:: smart_monitor/smartctl_view_sda_test_result_output
:caption: ``smartctl`` 查看sda测试结果,可以看到存储健康度(剩余寿命) ``92%``
:emphasize-lines: 65,78

这里可以看到 ``SSD_LifeLeft(0.01%)`` 表示以 万分比 ``0.01%`` 为单位得到的数值是 ``9126`` ,折算为百分比就是 ``91.26%`` ,所以在 ``Drive_Life_Remaining%`` 的数值就是 ``92``

``/dev/sdb``
~~~~~~~~~~~~~~

- 执行测试(long test):

.. literalinclude:: smart_monitor/smartctl_long_tests_sdb
:caption: ``smartctl`` 对sdb进行长测试,注意参数结合 ``-C`` 表示Foreground Mode

- 长测试输出信息

.. literalinclude:: smart_monitor/smartctl_long_tests_sdb_output
:caption: ``smartctl`` 对sdb进行长测试的输出信息
:emphasize-lines: 8,9

Intel SSD的长测试 **似乎是真测试** 需要花费30分钟完成

- 查看测试结果( ``-a`` 参数 ):

.. literalinclude:: smart_monitor/smartctl_view_sdb_test_result
:caption: ``smartctl`` 查看sdb(Intel SSD)测试结果

.. literalinclude:: smart_monitor/smartctl_view_sdb_test_result_output
:caption: ``smartctl`` 查看sdb测试结果,测试了两次都没有完成 ``Extended captive`` : ``Interrupted (host reset)``
:emphasize-lines: 87,88

比较奇怪,这个 Intel SSD 的SMART测试看不到健康度(剩余寿命 ``ID #245`` ),而且测试状态没有完成 ``Interrupted (host reset)`` 。我连做两次测试都是这样(见高亮部分)

我想了一下,是不是因为这个 ``/dev/sdb`` 正在使用(挂载为系统盘),所以 ``Foreground Test`` 会被磁盘读写操作中断?

- 改为 ``Background Mode`` ``long tests`` 测试( 去掉 ``-C`` 参数 ):

.. literalinclude:: smart_monitor/smartctl_long_tests_sdb_background_mode
:caption: ``smartctl`` 对sdb进行长测试,注意 **没有使用** ``-C`` 参数表示 ``Background Mode``

此时会看到立即返回终端提示(不像 ``-C`` 参数需要等待卡住一会):

.. literalinclude:: smart_monitor/smartctl_long_tests_sdb_output_background_mode
:caption: ``smartctl`` 对sdb进行长测试( ``Background Mode`` )输出信息
:emphasize-lines: 2

可以看到测试时间依然是30分钟,不过提示是 ``off-line mode`` (之前 ``-C`` 参数显示 ``captive mode`` )

- 果然,采用 ``offline mode`` 方式扫描,就能够正常完成测试,输出结果如下:

.. literalinclude:: smart_monitor/smartctl_long_tests_sdb_background_mode_result_output
:caption: ``smartctl`` 对sdb进行长测试( ``Background Mode`` )能够正常完成测试,结果输出
:emphasize-lines: 88

这里看到 ``LifeTime(hours)`` 值是 ``24193`` 这个值就是 ``Power_On_Hours`` 值,也就是磁盘加电时长

很奇怪,为何Intel SSD无法查看 ``Drive_Life_Remaining%`` ?

搜索了一下,看来Intel有自己的诊断工具 `How to Perform Quick/Full Diagnostic of Intel® SSDs Using Intel® Memory and Storage Tool (Intel® MAS) GUI <https://www.intel.com/content/www/us/en/support/articles/000056729/memory-and-storage/ssd-management-tools.html>`_ (这个是Intel Optane SSDs / Memory 设备检测工具)

详细请参考 ` Support for Intel® Memory and Storage Tool <https://www.intel.com/content/www/us/en/support/products/202249/memory-and-storage/ssd-management-tools/intel-memory-and-storage-tool.html>`_


参考
========

Expand Down
1 change: 1 addition & 0 deletions source/linux/storage/disk/smart_monitor/apt_smartmontools
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sudo apt install smartmontools
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sudo smartctl -c /dev/sda
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (20160) seconds.
Offline data collection
capabilities: (0x5d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 1) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
1 change: 1 addition & 0 deletions source/linux/storage/disk/smart_monitor/smartctl_info
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sudo smartctl -i /dev/sda
17 changes: 17 additions & 0 deletions source/linux/storage/disk/smart_monitor/smartctl_info_intel
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
=== START OF INFORMATION SECTION ===
Model Family: Intel 545s Series SSDs
Device Model: INTEL SSDSC2KW512G8
Serial Number: BTLA7513037S512DGN
LU WWN Device Id: 5 5cd2e4 14eea7536
Firmware Version: LHF002C
User Capacity: 512,110,190,592 bytes [512 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Aug 23 11:42:31 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
16 changes: 16 additions & 0 deletions source/linux/storage/disk/smart_monitor/smartctl_info_sandisk
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
=== START OF INFORMATION SECTION ===
Model Family: Sandisk SATA Cloudspeed Max and GEN2 ESS SSDs
Device Model: SDLF1CRR-019T-1HA1
Serial Number: A007C9D9
LU WWN Device Id: 5 001173 100a88424
Firmware Version: ZR11RPA1
User Capacity: 1,920,383,410,176 bytes [1.92 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Aug 23 11:43:03 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sudo smartctl -t long -C /dev/sda
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-78-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in captive mode".
Drive command "Execute SMART Extended self-test routine immediately in captive mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Wed Aug 23 15:05:27 2023 CST
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sudo smartctl -t long -C /dev/sdb
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sudo smartctl -t long /dev/sdb

0 comments on commit 2a5cba7

Please sign in to comment.