Skip to content

Commit

Permalink
resync NPD
Browse files Browse the repository at this point in the history
  • Loading branch information
Zhuzhenghao committed Jan 30, 2023
1 parent 3906c45 commit 929c83e
Showing 1 changed file with 126 additions and 101 deletions.
227 changes: 126 additions & 101 deletions content/zh-cn/docs/tasks/debug/debug-cluster/monitor-node-health.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,23 +13,23 @@ weight: 20
-->

<!-- overview -->
<!--
<!--
*Node Problem Detector* is a daemon for monitoring and reporting about a node's health.
You can run Node Problem Detector as a `DaemonSet` or as a standalone daemon.
Node Problem Detector collects information about node problems from various daemons
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
and reports these conditions to the API server as Node [Condition](/docs/concepts/architecture/nodes/#condition)s
or as [Event](/docs/reference/kubernetes-api/cluster-resources/event-v1)s.
To learn how to install and use Node Problem Detector, see
[Node Problem Detector project documentation](https://github.com/kubernetes/node-problem-detector).
-->

*节点问题检测器(Node Problem Detector)* 是一个守护程序,用于监视和报告节点的健康状况。
你可以将节点问题探测器以 `DaemonSet` 或独立守护程序运行。
节点问题检测器从各种守护进程收集节点问题,并以
[NodeCondition](/zh-cn/docs/concepts/architecture/nodes/#condition)
[Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core)
的形式报告给 API 服务器。
节点问题检测器从各种守护进程收集节点问题,并以节点
[Condition](/zh-cn/docs/concepts/architecture/nodes/#condition)
[Event](/zh-cn/docs/reference/kubernetes-api/cluster-resources/event-v1)
的形式报告给 API 服务器。

要了解如何安装和使用节点问题检测器,请参阅
[节点问题探测器项目文档](https://github.com/kubernetes/node-problem-detector)
Expand All @@ -40,46 +40,41 @@ To learn how to install and use Node Problem Detector, see

<!-- steps -->

<!--
## Limitations
* Node Problem Detector only supports file based kernel log.
Log tools such as `journald` are not supported.
<!--
## Limitations
* Node Problem Detector uses the kernel log format for reporting kernel issues.
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
-->
## 局限性 {#limitations}

* 节点问题检测器只支持基于文件类型的内核日志。
它不支持像 journald 这样的命令行日志工具。
* 节点问题检测器使用内核日志格式来报告内核问题。
要了解如何扩展内核日志格式,请参阅[添加对另一个日志格式的支持](#support-other-log-format)

<!--
<!--
## Enabling Node Problem Detector
Some cloud providers enable Node Problem Detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
You can also enable Node Problem Detector with `kubectl` or by creating an Addon pod.
You can also enable Node Problem Detector with `kubectl` or by creating an Addon DaemonSet.
-->
## 启用节点问题检测器

一些云供应商将节点问题检测器以{{< glossary_tooltip text="插件" term_id="addons" >}}形式启用。
你还可以使用 `kubectl` 或创建插件 Pod 来启用节点问题探测器。
你还可以使用 `kubectl` 或创建插件 DaemonSet 来启用节点问题探测器。

<!--
## Using kubectl to enable Node Problem Detector {#using-kubectl}
<!--
### Using kubectl to enable Node Problem Detector {#using-kubectl}
`kubectl` provides the most flexible management of Node Problem Detector.
You can overwrite the default configuration to fit it into your environment or
to detect customized node problems. For example:
-->
## 使用 kubectl 启用节点问题检测器 {#using-kubectl}
### 使用 kubectl 启用节点问题检测器 {#using-kubectl}

`kubectl` 提供了节点问题探测器最灵活的管理。
你可以覆盖默认配置使其适合你的环境或检测自定义节点问题。例如:

<!--
<!--
1. Create a Node Problem Detector configuration similar to `node-problem-detector.yaml`:
{{< codenew file="debug/node-problem-detector.yaml" >}}
Expand Down Expand Up @@ -107,7 +102,7 @@ to detect customized node problems. For example:
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
```

<!--
<!--
### Using an Addon pod to enable Node Problem Detector {#using-addon-pod}

If you are using a custom cluster bootstrap solution and don't need
Expand All @@ -117,33 +112,33 @@ further automate the deployment.
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
-->
### 使用插件 pod 启用节点问题检测器 {#using-addon-pod}
### 使用插件 Pod 启用节点问题检测器 {#using-addon-pod}

如果你使用的是自定义集群引导解决方案,不需要覆盖默认配置,
可以利用插件 Pod 进一步自动化部署。

创建 `node-strick-detector.yaml`,并在控制平面节点上保存配置到插件 Pod 的目录
`/etc/kubernetes/addons/node-problem-detector`

<!--
## Overwrite the Configuration
<!--
## Overwrite the configuration

The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12/config)
is embedded when building the Docker image of Node Problem Detector.
-->
## 覆盖配置文件

构建节点问题检测器的 docker 镜像时,会嵌入
[默认配置](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)。
[默认配置](https://github.com/kubernetes/node-problem-detector/tree/v0.8.12/config)。

<!--
<!--
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
to overwrite the configuration:
-->
不过,你可以像下面这样使用 [`ConfigMap`](/zh-cn/docs/tasks/configure-pod-container/configure-pod-configmap/)
将其覆盖:

<!--
<!--
1. Change the configuration files in `config/`
1. Create the `ConfigMap` `node-problem-detector-config`:

Expand All @@ -165,24 +160,24 @@ to overwrite the configuration:
-->
1. 更改 `config/` 中的配置文件
1. 创建 `ConfigMap` `node-strick-detector-config`

```shell
kubectl create configmap node-problem-detector-config --from-file=config/
```

1. 更改 `node-problem-detector.yaml` 以使用 ConfigMap:

{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}

1. 使用新的配置文件重新创建节点问题检测器:

```shell
```shell
# 如果你正在运行节点问题检测器,请先删除,然后再重新创建
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
```

<!--
<!--
{{< note >}}
This approach only applies to a Node Problem Detector started with `kubectl`.
{{< /note >}}
Expand All @@ -197,99 +192,129 @@ The Addon manager does not support `ConfigMap`.
如果节点问题检测器作为集群插件运行,则不支持覆盖配置。
插件管理器不支持 `ConfigMap`

<!--
## Kernel Monitor
<!--
## Problem Daemons

*Kernel Monitor* is a system log monitor daemon supported in the Node Problem Detector.
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
A problem daemon is a sub-daemon of the Node Problem Detector. It monitors specific kinds of node
problems and reports them to the Node Problem Detector.
There are several types of supported problem daemons.
-->
## 内核监视器

*内核监视器(Kernel Monitor)* 是节点问题检测器中支持的系统日志监视器守护进程。
内核监视器观察内核日志并根据预定义规则检测已知的内核问题。
## 问题守护程序

<!--
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can expand the rule list by overwriting the
configuration.
问题守护程序是节点问题检测器的子守护程序。
它监视特定类型的节点问题并报告给节点问题检测器。
支持下面几种类型的问题守护程序。

<!--
- A `SystemLogMonitor` type of daemon monitors the system logs and reports problems and metrics
according to predefined rules. You can customize the configurations for different log sources
such as [filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json),
[kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json),
[kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json),
[abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json),
and [systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json).
-->
- `SystemLogMonitor` 类型的守护程序根据预定义的规则监视系统日志并报告问题和指标。
你可以针对不同的日志源自定义配置如
[filelog](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-filelog.json)、
[kmsg](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor.json)、
[kernel](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/kernel-monitor-counter.json)、
[abrt](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/abrt-adaptor.json)
和 [systemd](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/systemd-monitor-counter.json)。

<!--
- A `SystemStatsMonitor` type of daemon collects various health-related system stats as metrics.
You can customize its behavior by updating its
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json).
-->
内核监视器根据 [`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json)
中的一组预定义规则列表匹配内核问题。
规则列表是可扩展的,你始终可以通过覆盖配置来扩展它。

<!--
### Add new NodeConditions
- `SystemStatsMonitor` 类型的守护程序收集各种与健康相关的系统统计数据作为指标。
你可以通过更新其[配置文件](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/system-stats-monitor.json)来自定义其行为。

To support a new `NodeCondition`, create a condition definition within the `conditions` field in
`config/kernel-monitor.json`, for example:
```
<!--
- A `CustomPluginMonitor` type of daemon invokes and checks various node problems by running
user-defined scripts. You can use different custom plugin monitors to monitor different
problems and customize the daemon behavior by updating the
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json).
-->
### 添加新的 NodeCondition

要支持新的 `NodeCondition`,请在 `config/kernel-monitor.json` 中的
`conditions` 字段中创建一个条件定义:
- `CustomPluginMonitor` 类型的守护程序通过运行用户定义的脚本来调用和检查各种节点问题。
你可以使用不同的自定义插件监视器来监视不同的问题,并通过更新
[配置文件](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/custom-plugin-monitor.json)
来定制守护程序行为。

```json
{
"type": "NodeConditionType",
"reason": "CamelCaseDefaultNodeConditionReason",
"message": "arbitrary default node condition message"
}
```
<!--
- A `HealthChecker` type of daemon checks the health of the kubelet and container runtime on a node.
-->
- `HealthChecker` 类型的守护程序检查节点上的 kubelet 和容器运行时的健康状况。

<!--
### Detect new problems
<!--
### Adding support for other log format {#support-other-log-format}

To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
with a new rule definition:
The system log monitor currently supports file-based logs, journald, and kmsg.
Additional sources can be added by implementing a new
[log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go).
-->
### 检测新的问题

你可以使用新的规则描述来扩展 `config/kernel-monitor.json` 中的 `rules` 字段以检测新问题:
### 增加对其他日志格式的支持 {#support-other-log-format}

```json
{
"type": "temporary/permanent",
"condition": "NodeConditionOfPermanentIssue",
"reason": "CamelCaseShortReason",
"message": "regexp matching the issue in the kernel log"
}
```
系统日志监视器目前支持基于文件的日志、journald 和 kmsg。
可以通过实现一个新的
[log watcher](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/systemlogmonitor/logwatchers/types/log_watcher.go)
来添加额外的日志源。

<!--
### Configure path for the kernel log device {#kernel-log-device-path}
<!--
### Adding custom plugin monitors

Check your kernel log path location in your operating system (OS) distribution.
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
You can configure the `log` field to match the device path as seen by the Node Problem Detector.
You can extend the Node Problem Detector to execute any monitor scripts written in any language by
developing a custom plugin. The monitor scripts must conform to the plugin protocol in exit code
and standard output. For more information, please refer to the
[plugin interface proposal](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#).
-->
### 配置内核日志设备的路径 {#kernel-log-device-path}

检查你的操作系统(OS)发行版本中的内核日志路径位置。
Linux 内核[日志设备](https://www.kernel.org/doc/documentation/abi/testing/dev-kmsg)
通常呈现为 `/dev/kmsg`
但是,日志路径位置因 OS 发行版本而异。
`config/kernel-monitor.json` 中的 `log` 字段表示容器内的日志路径。
你可以配置 `log` 字段以匹配节点问题检测器所示的设备路径。
### 添加自定义插件监视器

<!--
### Add support for another log format {#support-other-log-format}
你可以通过开发自定义插件来扩展节点问题检测器,以执行以任何语言编写的任何监控脚本。
监控脚本必须符合退出码和标准输出的插件协议。
有关更多信息,请参阅
[插件接口提案](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#).

<!--
## Exporter

An exporter reports the node problems and/or metrics to certain backends.
The following exporters are supported:

Kernel monitor uses the
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
You can implement a new translator for a new log format.
- **Kubernetes exporter**: this exporter reports node problems to the Kubernetes API server.
Temporary problems are reported as Events and permanent problems are reported as Node Conditions.

- **Prometheus exporter**: this exporter reports node problems and metrics locally as Prometheus
(or OpenMetrics) metrics. You can specify the IP address and port for the exporter using command
line arguments.

- **Stackdriver exporter**: this exporter reports node problems and metrics to the Stackdriver
Monitoring API. The exporting behavior can be customized using a
[configuration file](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json).
-->
### 添加对其它日志格式的支持 {#support-other-log-format}

内核监视器使用
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator.go)
插件转换内核日志的内部数据结构。
你可以为新的日志格式实现新的转换器。
## 导出器

导出器(Exporter)向特定后端报告节点问题和/或指标。
支持下列导出器:

- **Kubernetes exporter**: 此导出器向 Kubernetes API 服务器报告节点问题。
临时问题报告为事件,永久性问题报告为节点状况。

- **Prometheus exporter**: 此导出器在本地将节点问题和指标报告为 Prometheus(或 OpenMetrics)指标。
你可以使用命令行参数指定导出器的 IP 地址和端口。

- **Stackdriver exporter**: 此导出器向 Stackdriver Monitoring API 报告节点问题和指标。
可以使用[配置文件](https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/config/exporter/stackdriver-exporter.json)自定义导出行为。

<!-- discussion -->

<!--
<!--
## Recommendations and restrictions

It is recommended to run the Node Problem Detector in your cluster to monitor node health.
Expand All @@ -299,7 +324,7 @@ Usually this is fine, because:
* The kernel log grows relatively slowly.
* A resource limit is set for the Node Problem Detector.
* Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
-->
## 建议和限制

Expand Down

0 comments on commit 929c83e

Please sign in to comment.