Merge pull request #74 from intergral/run_books

docs(runbooks): add docs for runbooks
intergral · Nov 30, 2023 · 8c4f86b · 8c4f86b
2 parents d3da991 + 9e601c8
commit 8c4f86b
Show file tree

Hide file tree

Showing 7 changed files with 111 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,9 @@
 <!-- 1.0.3 START -->
 # 1.0.3 (xx/xx/2023)
-
+- **[CHANGE]**: add docs for run books [#74](https://github.com/intergral/deep/pull/74) [@Umaaz](https://github.com/Umaaz)
+- **[CHANGE]**: unify metric namespaces and subsystems [#73](https://github.com/intergral/deep/pull/73) [@Umaaz](https://github.com/Umaaz)
+- **[CHANGE]**: unify span tags for tenant [#70](https://github.com/intergral/deep/pull/70) [@Umaaz](https://github.com/Umaaz)
+- **[BUGFIX]**: fix port in local docker example [#72](https://github.com/intergral/deep/pull/72) [@Umaaz](https://github.com/Umaaz)
 <!-- 1.0.3 END -->
 
 <!-- 1.0.2 START -->

diff --git a/docs/docs/_sections/bug_report.md b/docs/docs/_sections/bug_report.md
@@ -0,0 +1,5 @@
+# Report Issues
+
+If there are any errors that have been reported in the logs, or you are experiencing strange behaviour please create an
+issue on the [Github](https://github.com/intergral/deep/issues/new/choose) project. This will allow us to improve Deep
+and hopefully help you resolve your issues.  
diff --git a/docs/docs/config/compaction.md b/docs/docs/config/compaction.md
@@ -0,0 +1,29 @@
+# Compaction/Retention
+
+Compaction and Retention are the methods that are used by Deep to reduce the number of blocks that are stored to both
+improve performance by reducing the block count, and remove older data that is no longer needed.
+
+## Compaction
+
+Compaction works by grouping blocks by time frame and combining the blocks together to reduce the overall number of
+blocks that have to be scanned when performing a query.
+
+The compaction can be configured using the settings:
+
+| Name                      | Default | Description                                                                |
+|---------------------------|---------|----------------------------------------------------------------------------|
+| compaction_window         | 1h      | This is the maximum time range a block should contain.                     |
+| max_compaction_objects    | 6000000 | This is the maximum number of Snapshots that will be stored in each block. |
+| max_block_bytes           | 100 GiB | This is the maximum size in bytes that each block can be.                  |
+| block_retention           | 14d     | This is the total time a block will be stored for.                         |
+| compacted_block_retention | 1h      | This is the duration a compacted block will be stored for.                 |
+| compaction_cycle          | 1h      | The time between each compaction cycle.                                    |
+
+By modifying these settings you can control how often blocks are compacted, how big they should be and how much time
+they should span. There is no one size fits all config for compaction.
+
+# Retention
+
+Retention is when blocks are deleted, once a block has been compacted it is marked for deletion. This deletion cycle
+occurs based on the config and will scan for marked and eligible blocks to be deleted. A block is eligible for deletion
+if it has been compacted and the `compacted_block_retention` period has expired.
diff --git a/docs/docs/images/rb_block_increase_1.png b/docs/docs/images/rb_block_increase_1.png
diff --git a/docs/docs/runbooks/block_increase.md b/docs/docs/runbooks/block_increase.md
@@ -0,0 +1,34 @@
+# Block Increase
+
+## Meaning
+
+This alert is indicating that the number of blocks has increased more than expected. This alert is trying to identify
+the case where compaction or retention is not working correctly.
+
+## Impact
+
+If the number of blocks is increasing steadily for a long enough period this can impact the performance and cost of
+Deep. As the number of blocks increase the time spent indexing and the cost of performing the indexing will increase.
+
+## Diagnosis
+
+Check the 'Deep/Tenants' dashboard for the number of blocks for the given tenant (tenant id should appear on the alert).
+This graph should give you insight into the block growth.
+
+A graph like this below indicates that the blocks are not getting deleted, the compactor logs should be inspected for
+any errors.
+
+![Block Increase](../images/rb_block_increase_1.png)
+
+If there is only a short spike in the blocks, this probably means there was a sudden large increase in usage. The blocks
+should be monitored for further issues.
+
+## Mitigation
+
+If compaction/retention is nor working then the compactor should be restarted. This could resolve any issues within the
+system memory, or if there are any failed tasks. Additionally check the permissions on the storage provider to ensure
+Deep has permission to delete data.
+
+It is also advisable to check the [compaction settings](../config/compaction.md) that are being used to ensure they best suite your use case.
+
+{!_sections/bug_report.md!}
diff --git a/docs/docs/runbooks/missing_ring_node.md b/docs/docs/runbooks/missing_ring_node.md
@@ -0,0 +1,37 @@
+# Missing Ring Node
+
+## Meaning
+
+This error happens when a ring is configured to have _n_ nodes but the actual number of nodes in the ring is either more
+or less than that number.
+
+## Impact
+
+If the number of nodes continues to be in error, then the health of the ring can become unstable.
+
+## Diagnosis
+
+The diagnosis of the alert depends on if there are more or fewer nodes than expected.
+
+### More nodes
+
+If there are more nodes than there should be this could be due to a failure to shut down a node correctly. This can lead
+to an [unhealthy node](./unhealthy_ring_node.md) scenario.
+
+It could also be a sign that the number of replicas has been changed manually to address a scaling issue. Additionally,
+it is possible that the `HorizontalPodAutoscaler` as kicked in to address a resource issue. In either case of scaling
+the helm chart config should be updated to reflect the changes if they are to become permanent. This way the alert
+config will be updated to reflect this change.
+
+### Fewer nodes
+
+If there are fewer nodes than there should be this could be due to a failure to start a new node. This could be due to a
+resource starvation on the kubernetes cluster. The description of the deployment should be reviewed, and any errors here
+addressed. Reviewing the pod logs for any failures in start up is also advised.
+
+## Mitigation
+
+There is no generic way to correct this issue, it would depend on the cause. Using the notes above identify the error
+and look for ways to resolve the root cause
+
+{!_sections/bug_report.md!}.
diff --git a/docs/docs/runbooks/unhealthy_ring_node.md b/docs/docs/runbooks/unhealthy_ring_node.md
@@ -30,3 +30,5 @@ There is no generic way to correct the failure, it would depend on the reason fo
 
 If the ring is otherwise healthy then you can simply 'Forget' the node by using the action on the appropriate state page
 listed above. This will remove the node from the ring and redistribute the tokens.
+
+{!_sections/bug_report.md!}