Skip to content

Commit

Permalink
Merge pull request #1104 from sun7927/doc
Browse files Browse the repository at this point in the history
added the documents for audit and failover
  • Loading branch information
sun7927 committed Aug 29, 2023
2 parents 7ee6990 + 513ec58 commit 886f8b0
Show file tree
Hide file tree
Showing 4 changed files with 208 additions and 0 deletions.
49 changes: 49 additions & 0 deletions docs/docs/quick_start/advanced_features/fast_failover.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
sidebar_position: 6
sidebar_label: "Fast Failover"
---

# Application Fast Failover

When the stateful application (i.e. Pod with HwameiStor volume) runs into a problem, especially caused by the node issue,
it's important to reschedule the Pod to another health node and keep running.

However, due to the design of the Kubernetes' StatefulSet and Deployment,
it will wait a long time (e.g. 5 mins) before rescheduling the Pod.
Especially, it will never reschedule the Pod automatically for the StatefulSet Pod.
This will cause the application stop, and even cause a huge business loss.

HwameiStor provides a feature of fast failover to solve this problem. When identifying the application issue,
it will reschedule the Pod immediately without waiting for a very long time.
HwameiStor will fail the Pod over to another health node, and ensure the required data volumes are also located at the node.
So, the application can continue to work.

# How to use

HwameiStor provides the fast failover considering the two cases:

* Node Failure

When a node fails, all the Pods on this node can't work any more。As to the Pod using HwameiStor volume,
it's necessary to reschedule to another health node with the associated data volume replica.
User can trigger the fast failover for this node by:
```
Add a label to this node:
kubectl label node <nodeName> hwameistor.io/failover=start
When the fast failover completes, the label will be modified as:
hwameistor.io/failover=completed
```

* Pod Failure

When a Pod fails, user can trigger the fast failover for it by:
```
Add a lable to this Pod:
kubectl label pod <podName> hwameistor.io/failover=start
When the fast failover completes, the old Pod will be deleted and then the new one will be created on a new node.
```
58 changes: 58 additions & 0 deletions docs/docs/quick_start/advanced_features/system_audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
sidebar_position: 7
sidebar_label: "System Audit"
---

# System Audit

It's important to record the information about the system operation history. HwameiStor provides a feature of audit to record the operations on all the system resources, including Cluster, Node, StoragePool, Volume, etc...

The audit information is easier for user to understant and parse for various purposes.

# How to use

HwameiStor designs a new CRD for every resource as below:

```yaml
apiVersion: hwameistor.io/v1alpha1
kind: Event
name:
spec:
resourceType: <Cluster | Node | StoragePool | Volume>
resourceName:
records:
- action:
actionContent: #in JSON format
time:
state:
stateContent: #in JSON format
```

For instance, let's look a audit information of a volume:

```yaml
apiVersion: hwameistor.io/v1alpha1
kind: Event
metadata:
creationTimestamp: "2023-08-08T15:52:55Z"
generation: 5
name: volume-pvc-34e3b086-2d95-4980-beb6-e175fd79a847
resourceVersion: "10221888"
uid: d3ebaffb-eddb-4c84-93be-efff350688af
spec:
resourceType: Volume
resourceName: pvc-34e3b086-2d95-4980-beb6-e175fd79a847
records:
- action: Create
actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81"}'
time: "2023-08-08T15:52:55Z"
- action: Mount
actionContent: '{"allocatedCapacityBytes":5368709120,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
time: "2023-08-08T15:53:07Z"
- action: Unmount
actionContent: '{"allocatedCapacityBytes":5368709120,"usedCapacityBytes":33783808,"totalInode":2621120,"usedInode":3,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
time: "2023-08-08T16:03:03Z"
- action: Delete
actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81","config":{"version":1,"volumeName":"pvc-34e3b086-2d95-4980-beb6-e175fd79a847","requiredCapacityBytes":5368709120,"convertible":true,"resourceID":2,"readyToInitialize":true,"initialized":true,"replicas":[{"id":1,"hostname":"k8s-node1","ip":"10.6.113.101","primary":true},{"id":2,"hostname":"k8s-master","ip":"10.6.113.100","primary":false}]},"delete":true}'
time: "2023-08-08T16:03:38Z"
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
sidebar_position: 6
sidebar_label: "应用故障恢复"
---

# 快速故障恢复

针对 Kubernetes 中的有状态应用(挂载了 HwameiStor PVC 的 Pod ),当 Pod 或者 PVC 出现问题时,尤其是 Kubernetes 节点出现问题时,
需要及时发现并重新调度,将 Pod 调度到其他健康的节点,并能成功挂载 PVC。
由于 Kubernetes 调度机制的限制,需要先等待比较长的时间(e.g. 5分钟)才能确定可以重新调度 Pod。
此外,由于 Pod 挂载了 PVC,还需额外等待较长时间(e.g. 6分钟)。
如果是 Statefulset 的 Pod,Kubernetes 不会进行重新调度,Deployment 的 Pod 可以。
这种情况将导致应用中断比较长时间,无法继续正常提供业务。

HwameiStor 为解决这类故障,提供了应用故障快速快速的能力。
在发现应用出现故障时,在很短的时间内将应用调度至另外的健康节点,同时保证在新节点上有应用所需的数据卷副本,从而保证业务应用正常运行。

# 使用方式

HwameiStor 为两类情况提供了应用故障快速恢复机制:

* 节点出现故障

在这种情况下,该节点上的应用均无法正常运行。对于使用 HwameiStor 数据卷的应用,需要及时地将 Pod 重新调度到新的健康节点。
用户可以通过下列方式进行故障恢复:
```
为该节点打标签(Label):
kubectl label node <nodeName> hwameistor.io/failover=start
当故障恢复完成后,上面的标签会变成:
hwameistor.io/failover=completed
```

* 应用 Pod 出现故障

在这种情况下,用户可以通过下列方式对 Pod 进行故障恢复:
```
为该 Pod 打标签(Label):
kubectl label pod <podName> hwameistor.io/failover=start
当故障恢复完成后,旧的 Pod 会被删除,新的 Pod 会在新的节点上启动并正常运行。
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
sidebar_position: 7
sidebar_label: "系统审计日志"
---

# 审计日志

为了记录 HwameiStor 集群系统的使用和操作历史信息,HwameiStor 提供了系统审计日志。该审计日志具有 HwameiStor 系统语义,易于用户查阅、解析。
审计日志针对 HwameiStor 系统中的每类资源,记录其使用操作信息。该资源包括:Cluster、Node、StoragePool、Volume,等等。

# 使用方式

审计日志通过 CRD 的方式存入系统中,为每一个资源创建一个 CR 来记录其操作历史。该 CRD 如下:

```yaml
apiVersion: hwameistor.io/v1alpha1
kind: Event
name:
spec:
resourceType: <Cluster | Node | StoragePool | Volume>
resourceName:
records:
- action:
actionContent: #in JSON format
time:
state:
stateContent: #in JSON format

```

```yaml
apiVersion: hwameistor.io/v1alpha1
kind: Event
metadata:
creationTimestamp: "2023-08-08T15:52:55Z"
generation: 5
name: volume-pvc-34e3b086-2d95-4980-beb6-e175fd79a847
resourceVersion: "10221888"
uid: d3ebaffb-eddb-4c84-93be-efff350688af
spec:
resourceType: Volume
resourceName: pvc-34e3b086-2d95-4980-beb6-e175fd79a847
records:
- action: Create
actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81"}'
time: "2023-08-08T15:52:55Z"
- action: Mount
actionContent: '{"allocatedCapacityBytes":5368709120,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
time: "2023-08-08T15:53:07Z"
- action: Unmount
actionContent: '{"allocatedCapacityBytes":5368709120,"usedCapacityBytes":33783808,"totalInode":2621120,"usedInode":3,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
time: "2023-08-08T16:03:03Z"
- action: Delete
actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81","config":{"version":1,"volumeName":"pvc-34e3b086-2d95-4980-beb6-e175fd79a847","requiredCapacityBytes":5368709120,"convertible":true,"resourceID":2,"readyToInitialize":true,"initialized":true,"replicas":[{"id":1,"hostname":"k8s-node1","ip":"10.6.113.101","primary":true},{"id":2,"hostname":"k8s-master","ip":"10.6.113.100","primary":false}]},"delete":true}'
time: "2023-08-08T16:03:38Z"
```

0 comments on commit 886f8b0

Please sign in to comment.