Skip to content

Commit

Permalink
SMART disk monitor
Browse files Browse the repository at this point in the history
  • Loading branch information
huataihuang committed Aug 17, 2023
1 parent b99f996 commit 782d595
Show file tree
Hide file tree
Showing 16 changed files with 207 additions and 7 deletions.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions source/flask/flask_startup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,43 @@ Web应用会使用一些有意义的URLs让用户访问以及调用不同的函
:language: python
:caption: 根据用户访问路径来返回不同类型的数据

URL Building
================

构建特定函数的URL,使用 ``url_for()`` 函数,可以接受函数名作为第一个参数,以及任意数量的关键字参数。每个参数对应于URL规则的变量部分,未知的变量部分作为查询参数附加到URL中:

- URL反转功能 ``url_for()`` 构建URL,而不是硬编码到模版中:

- 通常比硬编码URL更具描述性
- URL构建透明地处理特殊字符转义
- 生成的路径始终是绝对路径,避免浏览器中相对路径的意外行为

.. literalinclude:: flask_startup/url_for.py
:language: python
:caption: 动态构建URL

则可以访问以下路径::

/
/login
/login?next=/
/user/John%20Doe

HTTP metheods
=================

同样的URL,使用不同的HTTP methods会提供不同的功能,例如 ``login`` ,通常区分 ``GET`` 和 ``POST`` :

.. literalinclude:: flask_startup/http_methods.py
:language: python
:caption: 区分 ``GET`` 和 ``POST``

此外 flask 还提供了对于 ``get()`` 和 ``post()`` 方法的路由快捷方式,用于常用的HTTP method:

.. literalinclude:: flask_startup/http_methods_shortcut.py
:language: python
:caption: 区分 ``GET`` 和 ``POST`` 快捷方式

参考
=======

Expand Down
8 changes: 8 additions & 0 deletions source/flask/flask_startup/http_methods.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
form flask import FLASK

@app.route('/login', methods=['GET', 'POST'])
def login():
if request.method == 'POST':
return do_the_login()
else:
return show_the_login_form()
9 changes: 9 additions & 0 deletions source/flask/flask_startup/http_methods_shortcut.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
form flask import FLASK

@app.get('/login')
def login_get():
return show_the_login_form()

@app.post('/login')
def login_post():
return do_the_login()
19 changes: 19 additions & 0 deletions source/flask/flask_startup/url_for.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from flask import url_for

@app.route('/')
def index():
return 'index'

@app.route('/login')
def login():
return 'login'

@app.route('/usr/<username>')
def profile(username):
return f'{username}\'s profile'

with app.test_request_context():
print(url_for('index'))
print(url_for('login'))
print(url_for('login', netx='/'))
print(url_for('profile', username-='John Doe'))
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ Node Exporter ipmitool 文本插件
准备工作
==========

- 创建一个 ``/var/lib/node_exporter/textfile_collector/`` 用于存放 ``--collector.textfile.directory`` 对应的 ``*.prom`` 文件,以便转换成metrics::
- 创建一个 ``/var/lib/node_exporter/textfile_collector/`` 用于存放 ``--collector.textfile.directory`` 对应的 ``*.prom`` 文件,以便转换成metrics:

sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chomd 777 /var/lib/node_exporter/textfile_collector
.. literalinclude:: node_exporter_textfile-collector/textfile_collector_dir
:caption: 准备 ``/var/lib/node_exporter/textfile_collector/`` 目录

- Prometheus社区提供了 `node-exporter-textfile-collector-scripts <https://github.com/prometheus-community/node-exporter-textfile-collector-scripts>`_ ,将这些脚本下载到服务器上:

Expand Down Expand Up @@ -60,8 +60,17 @@ Node Exporter ipmitool 文本插件

重启 ``node_exporter`` 服务

配置 Grafana Dashboard
=========================

:ref:`grafana` 中 ``import`` `Grafana Dashboard 13177: IPMI for Prometheus <https://grafana.com/grafana/dashboards/13177-ipmi-for-prometheus/>`_

完成后Dashboard:

.. figure:: ../../../../_static/kubernetes/monitor/prometheus/prometheus_exporters/node_exporter_with_ipmitool_text_plugin.png

:ref:`grafana` 中 ``import`` `Grafana Dashboard 15765: IPMI Exporter <https://grafana.com/grafana/dashboards/15765-ipmi-exporter/>`_ (这个面板看起来更清晰,不过温度显示我比较喜欢 ``Time series`` 表现方式,所以我添加了一个视图)

完成后Dashboard:

.. figure:: ../../../../_static/kubernetes/monitor/prometheus/prometheus_exporters/node_exporter_with_ipmitool_text_plugin_1.png
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,70 @@
Node Exporter smartctl 文本插件
===================================

监控磁盘 SMART 数据,原理也是采用 :ref:`node_exporter_textfile-collector`
监控磁盘 SMART 数据,原理也是采用 :ref:`node_exporter_textfile-collector` ,并且 Prometheus社区提供了 `node-exporter-textfile-collector-scripts <https://github.com/prometheus-community/node-exporter-textfile-collector-scripts>`_ 包含了 ``smartmon.sh`` 和 ``smartmon.py`` 来输出符合Prometheus文本采集的数据

准备工作
==========

.. note::

这部分准备工作我已经在 :ref:`node_exporter_ipmitool_text_plugin` 完成

- 创建一个 ``/var/lib/node_exporter/textfile_collector/`` 用于存放 ``--collector.textfile.directory`` 对应的 ``*.prom`` 文件,以便转换成metrics:

.. literalinclude:: node_exporter_textfile-collector/textfile_collector_dir
:caption: 准备 ``/var/lib/node_exporter/textfile_collector/`` 目录

- Prometheus社区提供了 `node-exporter-textfile-collector-scripts <https://github.com/prometheus-community/node-exporter-textfile-collector-scripts>`_ ,将这些脚本下载到服务器上:

.. literalinclude:: node_exporter_textfile-collector/git_node-exporter-textfile-collector-scripts
:caption: 下载 ``node-exporter-textfile-collector-scripts`` 到本地( ``/etc/prometheus`` )

执行脚本
==========

- 社区脚本 ``smartmon.py`` 或 ``smartmon.sh`` 都可以用于输出,注意需要使用 ``sudo`` root权限::

sudo /etc/prometheus/node-exporter-textfile-collector-scripts/smartmon.sh | sponge /var/lib/node_exporter/textfile_collector/smartmon.prom

- 检查 ``/var/lib/node_exporter/textfile_collector/smartmon.prom`` 内容无误之后,配置 crontab ::

crontab -e

输入内容::

* * * * * /etc/prometheus/node-exporter-textfile-collector-scripts/smartmon.sh | sponge /var/lib/node_exporter/textfile_collector/smartmon.prom

配置 ``node_exporter``
==========================

.. note::

这部分准备工作我已经在 :ref:`node_exporter_ipmitool_text_plugin` 完成

按照 :ref:`node_exporter` 中 :ref:`systemd` 运行服务配置,修订 ``/etc/systemd/system/node_exporter.service`` ::

ExecStart=/usr/local/bin/node_exporter \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector

重启 ``node_exporter`` 服务

配置 Grafana Dashboard
=========================

:ref:`grafana` 中 ``import`` `Grafana Dashboard 16514: SMART + NVMe status <https://grafana.com/grafana/dashboards/16514-smart-nvme-status/>`_

改进版本(推荐)
====================================

- 使用修订过的 `janw / node-exporter-textfile-collector-scripts / smartmon.sh <https://github.com/janw/node-exporter-textfile-collector-scripts/blob/master/smartmon.sh>`_

- `Grafana Dashboard 10664: SMART disk data <https://grafana.com/grafana/dashboards/10664-smart-disk-data/>`_ 也比较清晰
- `Grafana Dashboard 10664: SMART disk data <https://grafana.com/grafana/dashboards/10664-smart-disk-data/>`_ 这个面板强烈推荐,我发现比使用 `Grafana Dashboard 16514: SMART + NVMe status <https://grafana.com/grafana/dashboards/16514-smart-nvme-status/>`_ 更好更详细

.. figure:: ../../../../_static/kubernetes/monitor/prometheus/prometheus_exporters/node_exporter_with_smartmon_text_plugin.png

其他
=======

- 使用 `olegeech-me / S.M.A.R.T-disk-monitoring-for-Prometheus <https://github.com/olegeech-me/S.M.A.R.T-disk-monitoring-for-Prometheus/>`_ (从 `micha37-martins / S.M.A.R.T-disk-monitoring-for-Prometheus <https://github.com/micha37-martins/S.M.A.R.T-disk-monitoring-for-Prometheus>`_ fork出来):

Expand All @@ -18,3 +77,8 @@ Node Exporter smartctl 文本插件

- `Grafana Dashboard 10530: S.M.A.R.T disk monitoring for Prometheus Dashboard <https://grafana.com/grafana/dashboards/10530-s-m-a-r-t-disk-monitoring-for-prometheus-dashboard/>`_ 这个概况比较好,准备使用
- `Grafana Dashboard 10531: S.M.A.R.T disk monitoring for Prometheus Errorboard <https://grafana.com/grafana/dashboards/10531-s-m-a-r-t-disk-monitoring-for-prometheus-errorboard/>`_ 主要扩展error details

参考
======

- `Monitoring a mixed fleet of flash, HDD, and NVMe devices with node_exporter and Prometheus <https://www.wirewd.com/hacks/blog/monitoring_a_mixed_fleet_of_flash_hdd_and_nvme_devices_with_node_exporter_and_prometheus>`_
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Prometheus社区提供了 `node-exporter-textfile-collector-scripts <https://git
==========

- :ref:`node_exporter_ipmitool_text_plugin`
- :ref:`node_exporter_smartctl_text_plugin`

参考
========
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chomd 777 /var/lib/node_exporter/textfile_collector
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
PromQL查询基础
==============================

Prometheus提供了一种名为 ``PromQL`` 的函数式查询语言,可以让用户实时选择和聚合时序数据(series data)。表达式的结果剋显示为图形或表格形式,页可以通过HTTP API在外部调用。
Prometheus提供了一种名为 ``PromQL`` 的函数式查询语言,可以让用户实时选择和聚合时序数据(series data)。表达式的结果可以显示为图形或表格形式,也可以通过HTTP API在外部调用。

表达式语言数据类型
======================
Expand All @@ -22,3 +22,4 @@ Prometheus的表达式语言(Expression language)中,表达式 或 子表达
======

- `QUERYING PROMETHEUS: Bssics <https://prometheus.io/docs/prometheus/latest/querying/basics/>`_
- `PromQL Tutorial: 5 Tricks to Become a Prometheus God <https://coralogix.com/blog/promql-tutorial-5-tricks-to-become-a-prometheus-god/>`_
4 changes: 4 additions & 0 deletions source/linux/server/cockpit/intro_cockpit.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ Cockpit是Linux服务器的系统管理平台,可以用于管理容器、存

访问: https://ip-address-of-machine:9090

.. note::

由于 :ref:`prometheus` 默认也使用 ``9090`` 端口,所以我调整 :ref:`cockpit_port_address` 为 ``9091``

很多主流的Linux发行版都内置支持了Cockpit(当前Arch Linux也内置支持了cockpit,不需要再从第三方社区仓库安装):

.. figure:: ../../../_static/linux/server/cockpit/cockpit_support_linux.png
Expand Down
10 changes: 10 additions & 0 deletions source/linux/storage/disk/smart_monitor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@
存储设备S.M.A.R.T监控
=======================

我的二手 :ref:`hpe_dl360_gen9` 服务器使用了一块我很久以前购买的Intel SATA SSD磁盘,不过这块SSD时不时在系统日志中留下触目惊心的Err记录:

.. literalinclude:: smart_monitor/dmesg_ssd_error
:caption: ``dmesg`` 中SSD磁盘错误日志

我想通过存储的 S.M.A.R.T. 技术来检测和监视磁盘的异常:

- 本文的 ``smartctl`` 命令行检查(基础能力)
- :ref:`node_exporter_smartctl_text_plugin` 通过自己部署的 Prometheus + Grafana 监控来直观观察

参考
========

Expand Down
35 changes: 35 additions & 0 deletions source/linux/storage/disk/smart_monitor/dmesg_ssd_error
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[Sun Aug 6 11:05:54 2023] ata5.00: exception Emask 0x0 SAct 0x80080000 SErr 0x0 action 0x6 frozen
[Sun Aug 6 11:05:54 2023] ata5.00: failed command: READ FPDMA QUEUED
[Sun Aug 6 11:05:54 2023] ata5.00: cmd 60/08:98:98:20:9c/00:00:02:00:00/40 tag 19 ncq dma 4096 in
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Aug 6 11:05:54 2023] ata5.00: status: { DRDY }
[Sun Aug 6 11:05:54 2023] ata5.00: failed command: READ FPDMA QUEUED
[Sun Aug 6 11:05:54 2023] ata5.00: cmd 60/08:f8:e8:e4:8c/00:00:00:00:00/40 tag 31 ncq dma 4096 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Aug 6 11:05:54 2023] ata5.00: status: { DRDY }
[Sun Aug 6 11:05:54 2023] ata5: hard resetting link
[Sun Aug 6 11:05:54 2023] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Sun Aug 6 11:05:54 2023] ata5.00: configured for UDMA/133
[Sun Aug 6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[Sun Aug 6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 Sense Key : Illegal Request [current]
[Sun Aug 6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 Add. Sense: Unaligned write command
[Sun Aug 6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 CDB: Read(10) 28 00 00 8c e4 e8 00 00 08 00
[Sun Aug 6 11:05:54 2023] blk_update_request: I/O error, dev sdb, sector 9233640 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Sun Aug 6 11:05:54 2023] ata5: EH complete
[Sun Aug 6 11:05:54 2023] ata5.00: Enabling discard_zeroes_data
[Sun Aug 6 11:06:24 2023] ata5.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x6 frozen
[Sun Aug 6 11:06:24 2023] ata5.00: failed command: READ FPDMA QUEUED
[Sun Aug 6 11:06:24 2023] ata5.00: cmd 60/08:c0:70:1f:ce/00:00:00:00:00/40 tag 24 ncq dma 4096 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Aug 6 11:06:24 2023] ata5.00: status: { DRDY }
[Sun Aug 6 11:06:24 2023] ata5: hard resetting link
[Sun Aug 6 11:06:24 2023] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Sun Aug 6 11:06:24 2023] ata5.00: configured for UDMA/133
[Sun Aug 6 11:06:24 2023] ata5.00: device reported invalid CHS sector 0
[Sun Aug 6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[Sun Aug 6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 Sense Key : Illegal Request [current]
[Sun Aug 6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 Add. Sense: Unaligned write command
[Sun Aug 6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 CDB: Read(10) 28 00 00 ce 1f 70 00 00 08 00
[Sun Aug 6 11:06:24 2023] blk_update_request: I/O error, dev sdb, sector 13508464 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Sun Aug 6 11:06:24 2023] ata5: EH complete
[Sun Aug 6 11:06:24 2023] ata5.00: Enabling discard_zeroes_data
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,4 @@ RAID思考

- `Red Hat Enterprise Linux 9 Docs > Managing storage devices > Chapter 18. Managing RAID <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_storage_devices/managing-raid_managing-storage-devices>`_
- `Red Hat Enterprise Linux 9 Docs > 管理存储设备 > 第18章 管理RAID <https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/9/html/managing_storage_devices/managing-raid_managing-storage-devices>`_ (中文版)
- `archlinux: RAID <https://wiki.archlinux.org/title/RAID>`_ :ref:`arch_linux` 的文档总是那么完善全面,推荐阅读
2 changes: 1 addition & 1 deletion source/linux/storage/software_raid/mdadm_raid10.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,4 +107,4 @@ mdadm构建RAID10
- `Red Hat Enterprise Linux 9 Docs > 管理存储设备 > 第18章 管理RAID <https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/9/html/managing_storage_devices/managing-raid_managing-storage-devices>`_ (中文版)
- `How to configure RAID6 in centos 7 <https://www.linuxhelp.com/how-to-configure-raid6-in-centos-7>`_
- `Create Software RAID 10 With mdadm <https://allcloud.io/blog/create-software-raid-10-with-mdadm/>`_
- `SUSE Linux Enterprise Server Documentation / Storage Administration Guide / Software RAID / Creating Software RAID 10 Devices <https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-raid10.html>`_
- `SUSE Linux Enterprise Server Documentation / Storage Administration Guide / Software RAID / Creating Software RAID 10 Devices <https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-raid10.html>`_ SUSE的这个文档非常详细,其中有些细节需要仔细研究和学习(官方文档 Software RAID比RED HAT要详细很多)

0 comments on commit 782d595

Please sign in to comment.