Merge pull request #1104 from sun7927/doc

added the documents for audit and failover
hwameistor · Aug 29, 2023 · 886f8b0 · 886f8b0
2 parents 7ee6990 + 513ec58
commit 886f8b0
Show file tree

Hide file tree

Showing 4 changed files with 208 additions and 0 deletions.
diff --git a/docs/docs/quick_start/advanced_features/fast_failover.md b/docs/docs/quick_start/advanced_features/fast_failover.md
@@ -0,0 +1,49 @@
+---
+sidebar_position: 6
+sidebar_label: "Fast Failover"
+---
+
+# Application Fast Failover
+
+When the stateful application (i.e. Pod with HwameiStor volume) runs into a problem, especially caused by the node issue, 
+it's important to reschedule the Pod to another health node and keep running.
+
+However, due to the design of the Kubernetes' StatefulSet and Deployment, 
+it will wait a long time (e.g. 5 mins) before rescheduling the Pod. 
+Especially, it will never reschedule the Pod automatically for the StatefulSet Pod.
+This will cause the application stop, and even cause a huge business loss.
+
+HwameiStor provides a feature of fast failover to solve this problem. When identifying the application issue, 
+it will reschedule the Pod immediately without waiting for a very long time. 
+HwameiStor will fail the Pod over to another health node, and ensure the required data volumes are also located at the node. 
+So, the application can continue to work. 
+
+# How to use
+
+HwameiStor provides the fast failover considering the two cases:
+
+* Node Failure  
+
+  When a node fails, all the Pods on this node can't work any more。As to the Pod using HwameiStor volume，
+  it's necessary to reschedule to another health node with the associated data volume replica.
+  User can trigger the fast failover for this node by:
+  ```
+  Add a label to this node:
+
+  kubectl label node <nodeName> hwameistor.io/failover=start
+
+  When the fast failover completes, the label will be modified as:
+
+  hwameistor.io/failover=completed
+  ```
+
+* Pod Failure
+
+  When a Pod fails, user can trigger the fast failover for it by:
+  ```
+  Add a lable to this Pod:
+
+  kubectl label pod <podName> hwameistor.io/failover=start
+
+  When the fast failover completes, the old Pod will be deleted and then the new one will be created on a new node.
+  ```
diff --git a/docs/docs/quick_start/advanced_features/system_audit.md b/docs/docs/quick_start/advanced_features/system_audit.md
@@ -0,0 +1,58 @@
+---
+sidebar_position: 7
+sidebar_label: "System Audit"
+---
+
+# System Audit
+
+It's important to record the information about the system operation history. HwameiStor provides a feature of audit to record the operations on all the system resources, including Cluster, Node, StoragePool, Volume, etc...
+
+The audit information is easier for user to understant and parse for various purposes.
+
+# How to use
+
+HwameiStor designs a new CRD for every resource as below:
+
+```yaml
+apiVersion: hwameistor.io/v1alpha1
+kind: Event
+  name: 
+spec:
+  resourceType: <Cluster | Node | StoragePool | Volume>
+  resourceName:
+  records:
+  - action:
+    actionContent: #in JSON format
+    time:
+    state:
+    stateContent: #in JSON format
+```
+
+For instance, let's look a audit information of a volume:
+
+```yaml
+apiVersion: hwameistor.io/v1alpha1
+kind: Event
+metadata:
+  creationTimestamp: "2023-08-08T15:52:55Z"
+  generation: 5
+  name: volume-pvc-34e3b086-2d95-4980-beb6-e175fd79a847
+  resourceVersion: "10221888"
+  uid: d3ebaffb-eddb-4c84-93be-efff350688af
+spec:
+  resourceType: Volume
+  resourceName: pvc-34e3b086-2d95-4980-beb6-e175fd79a847
+  records:
+  - action: Create
+    actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81"}'
+    time: "2023-08-08T15:52:55Z"
+  - action: Mount
+    actionContent: '{"allocatedCapacityBytes":5368709120,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
+    time: "2023-08-08T15:53:07Z"
+  - action: Unmount
+    actionContent: '{"allocatedCapacityBytes":5368709120,"usedCapacityBytes":33783808,"totalInode":2621120,"usedInode":3,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
+    time: "2023-08-08T16:03:03Z"
+  - action: Delete
+    actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81","config":{"version":1,"volumeName":"pvc-34e3b086-2d95-4980-beb6-e175fd79a847","requiredCapacityBytes":5368709120,"convertible":true,"resourceID":2,"readyToInitialize":true,"initialized":true,"replicas":[{"id":1,"hostname":"k8s-node1","ip":"10.6.113.101","primary":true},{"id":2,"hostname":"k8s-master","ip":"10.6.113.100","primary":false}]},"delete":true}'
+    time: "2023-08-08T16:03:38Z"
+```
diff --git a/...urus-plugin-content-docs/current/quick_start/advanced_features/fast_failover.md b/...urus-plugin-content-docs/current/quick_start/advanced_features/fast_failover.md
@@ -0,0 +1,45 @@
+---
+sidebar_position: 6
+sidebar_label: "应用故障恢复"
+---
+
+# 快速故障恢复
+
+针对 Kubernetes 中的有状态应用（挂载了 HwameiStor PVC 的 Pod ），当 Pod 或者 PVC 出现问题时，尤其是 Kubernetes 节点出现问题时，
+需要及时发现并重新调度，将 Pod 调度到其他健康的节点，并能成功挂载 PVC。
+由于 Kubernetes 调度机制的限制，需要先等待比较长的时间（e.g. 5分钟）才能确定可以重新调度 Pod。
+此外，由于 Pod 挂载了 PVC，还需额外等待较长时间（e.g. 6分钟）。
+如果是 Statefulset 的 Pod，Kubernetes 不会进行重新调度，Deployment 的 Pod 可以。
+这种情况将导致应用中断比较长时间，无法继续正常提供业务。
+
+HwameiStor 为解决这类故障，提供了应用故障快速快速的能力。
+在发现应用出现故障时，在很短的时间内将应用调度至另外的健康节点，同时保证在新节点上有应用所需的数据卷副本，从而保证业务应用正常运行。
+
+# 使用方式
+
+HwameiStor 为两类情况提供了应用故障快速恢复机制：
+
+* 节点出现故障
+
+  在这种情况下，该节点上的应用均无法正常运行。对于使用 HwameiStor 数据卷的应用，需要及时地将 Pod 重新调度到新的健康节点。
+  用户可以通过下列方式进行故障恢复：
+  ```
+  为该节点打标签（Label）：
+
+  kubectl label node <nodeName> hwameistor.io/failover=start
+
+  当故障恢复完成后，上面的标签会变成：
+
+  hwameistor.io/failover=completed
+  ```
+
+* 应用 Pod 出现故障
+
+  在这种情况下，用户可以通过下列方式对 Pod 进行故障恢复：
+  ```
+  为该 Pod 打标签（Label）：
+
+  kubectl label pod <podName> hwameistor.io/failover=start
+
+  当故障恢复完成后，旧的 Pod 会被删除，新的 Pod 会在新的节点上启动并正常运行。
+  ```
diff --git a/...aurus-plugin-content-docs/current/quick_start/advanced_features/system_audit.md b/...aurus-plugin-content-docs/current/quick_start/advanced_features/system_audit.md
@@ -0,0 +1,56 @@
+---
+sidebar_position: 7
+sidebar_label: "系统审计日志"
+---
+
+# 审计日志
+
+为了记录 HwameiStor 集群系统的使用和操作历史信息，HwameiStor 提供了系统审计日志。该审计日志具有 HwameiStor 系统语义，易于用户查阅、解析。
+审计日志针对 HwameiStor 系统中的每类资源，记录其使用操作信息。该资源包括：Cluster、Node、StoragePool、Volume，等等。
+
+# 使用方式
+
+审计日志通过 CRD 的方式存入系统中，为每一个资源创建一个 CR 来记录其操作历史。该 CRD 如下：
+
+```yaml
+apiVersion: hwameistor.io/v1alpha1
+kind: Event
+  name: 
+spec:
+  resourceType: <Cluster | Node | StoragePool | Volume>
+  resourceName:
+  records:
+  - action:
+    actionContent: #in JSON format
+    time:
+    state:
+    stateContent: #in JSON format
+
+```
+
+```yaml
+apiVersion: hwameistor.io/v1alpha1
+kind: Event
+metadata:
+  creationTimestamp: "2023-08-08T15:52:55Z"
+  generation: 5
+  name: volume-pvc-34e3b086-2d95-4980-beb6-e175fd79a847
+  resourceVersion: "10221888"
+  uid: d3ebaffb-eddb-4c84-93be-efff350688af
+spec:
+  resourceType: Volume
+  resourceName: pvc-34e3b086-2d95-4980-beb6-e175fd79a847
+  records:
+  - action: Create
+    actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81"}'
+    time: "2023-08-08T15:52:55Z"
+  - action: Mount
+    actionContent: '{"allocatedCapacityBytes":5368709120,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
+    time: "2023-08-08T15:53:07Z"
+  - action: Unmount
+    actionContent: '{"allocatedCapacityBytes":5368709120,"usedCapacityBytes":33783808,"totalInode":2621120,"usedInode":3,"replicas":["pvc-34e3b086-2d95-4980-beb6-e175fd79a847-krp927","pvc-34e3b086-2d95-4980-beb6-e175fd79a847-wm7p56"],"state":"Ready","publishedNode":"k8s-node1","fsType":"xfs","rawblock":false}'
+    time: "2023-08-08T16:03:03Z"
+  - action: Delete
+    actionContent: '{"requiredCapacityBytes":5368709120,"volumeQoS":{},"poolName":"LocalStorage_PoolHDD","replicaNumber":2,"convertible":true,"accessibility":{"nodes":["k8s-node1","k8s-master"],"zones":["default"],"regions":["default"]},"pvcNamespace":"default","pvcName":"mysql-data-volume","volumegroup":"db890e34-a092-49ac-872b-f2a422439c81","config":{"version":1,"volumeName":"pvc-34e3b086-2d95-4980-beb6-e175fd79a847","requiredCapacityBytes":5368709120,"convertible":true,"resourceID":2,"readyToInitialize":true,"initialized":true,"replicas":[{"id":1,"hostname":"k8s-node1","ip":"10.6.113.101","primary":true},{"id":2,"hostname":"k8s-master","ip":"10.6.113.100","primary":false}]},"delete":true}'
+    time: "2023-08-08T16:03:38Z"
+```