Skip to content

Commit

Permalink
Merge pull request #74 from intergral/run_books
Browse files Browse the repository at this point in the history
docs(runbooks): add docs for runbooks
  • Loading branch information
Umaaz committed Nov 30, 2023
2 parents d3da991 + 9e601c8 commit 8c4f86b
Show file tree
Hide file tree
Showing 7 changed files with 111 additions and 1 deletion.
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
<!-- 1.0.3 START -->
# 1.0.3 (xx/xx/2023)

- **[CHANGE]**: add docs for run books [#74](https://github.com/intergral/deep/pull/74) [@Umaaz](https://github.com/Umaaz)
- **[CHANGE]**: unify metric namespaces and subsystems [#73](https://github.com/intergral/deep/pull/73) [@Umaaz](https://github.com/Umaaz)
- **[CHANGE]**: unify span tags for tenant [#70](https://github.com/intergral/deep/pull/70) [@Umaaz](https://github.com/Umaaz)
- **[BUGFIX]**: fix port in local docker example [#72](https://github.com/intergral/deep/pull/72) [@Umaaz](https://github.com/Umaaz)
<!-- 1.0.3 END -->

<!-- 1.0.2 START -->
Expand Down
5 changes: 5 additions & 0 deletions docs/docs/_sections/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Report Issues

If there are any errors that have been reported in the logs, or you are experiencing strange behaviour please create an
issue on the [Github](https://github.com/intergral/deep/issues/new/choose) project. This will allow us to improve Deep
and hopefully help you resolve your issues.
29 changes: 29 additions & 0 deletions docs/docs/config/compaction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Compaction/Retention

Compaction and Retention are the methods that are used by Deep to reduce the number of blocks that are stored to both
improve performance by reducing the block count, and remove older data that is no longer needed.

## Compaction

Compaction works by grouping blocks by time frame and combining the blocks together to reduce the overall number of
blocks that have to be scanned when performing a query.

The compaction can be configured using the settings:

| Name | Default | Description |
|---------------------------|---------|----------------------------------------------------------------------------|
| compaction_window | 1h | This is the maximum time range a block should contain. |
| max_compaction_objects | 6000000 | This is the maximum number of Snapshots that will be stored in each block. |
| max_block_bytes | 100 GiB | This is the maximum size in bytes that each block can be. |
| block_retention | 14d | This is the total time a block will be stored for. |
| compacted_block_retention | 1h | This is the duration a compacted block will be stored for. |
| compaction_cycle | 1h | The time between each compaction cycle. |

By modifying these settings you can control how often blocks are compacted, how big they should be and how much time
they should span. There is no one size fits all config for compaction.

# Retention

Retention is when blocks are deleted, once a block has been compacted it is marked for deletion. This deletion cycle
occurs based on the config and will scan for marked and eligible blocks to be deleted. A block is eligible for deletion
if it has been compacted and the `compacted_block_retention` period has expired.
Binary file added docs/docs/images/rb_block_increase_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions docs/docs/runbooks/block_increase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Block Increase

## Meaning

This alert is indicating that the number of blocks has increased more than expected. This alert is trying to identify
the case where compaction or retention is not working correctly.

## Impact

If the number of blocks is increasing steadily for a long enough period this can impact the performance and cost of
Deep. As the number of blocks increase the time spent indexing and the cost of performing the indexing will increase.

## Diagnosis

Check the 'Deep/Tenants' dashboard for the number of blocks for the given tenant (tenant id should appear on the alert).
This graph should give you insight into the block growth.

A graph like this below indicates that the blocks are not getting deleted, the compactor logs should be inspected for
any errors.

![Block Increase](../images/rb_block_increase_1.png)

If there is only a short spike in the blocks, this probably means there was a sudden large increase in usage. The blocks
should be monitored for further issues.

## Mitigation

If compaction/retention is nor working then the compactor should be restarted. This could resolve any issues within the
system memory, or if there are any failed tasks. Additionally check the permissions on the storage provider to ensure
Deep has permission to delete data.

It is also advisable to check the [compaction settings](../config/compaction.md) that are being used to ensure they best suite your use case.

{!_sections/bug_report.md!}
37 changes: 37 additions & 0 deletions docs/docs/runbooks/missing_ring_node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Missing Ring Node

## Meaning

This error happens when a ring is configured to have _n_ nodes but the actual number of nodes in the ring is either more
or less than that number.

## Impact

If the number of nodes continues to be in error, then the health of the ring can become unstable.

## Diagnosis

The diagnosis of the alert depends on if there are more or fewer nodes than expected.

### More nodes

If there are more nodes than there should be this could be due to a failure to shut down a node correctly. This can lead
to an [unhealthy node](./unhealthy_ring_node.md) scenario.

It could also be a sign that the number of replicas has been changed manually to address a scaling issue. Additionally,
it is possible that the `HorizontalPodAutoscaler` as kicked in to address a resource issue. In either case of scaling
the helm chart config should be updated to reflect the changes if they are to become permanent. This way the alert
config will be updated to reflect this change.

### Fewer nodes

If there are fewer nodes than there should be this could be due to a failure to start a new node. This could be due to a
resource starvation on the kubernetes cluster. The description of the deployment should be reviewed, and any errors here
addressed. Reviewing the pod logs for any failures in start up is also advised.

## Mitigation

There is no generic way to correct this issue, it would depend on the cause. Using the notes above identify the error
and look for ways to resolve the root cause

{!_sections/bug_report.md!}.
2 changes: 2 additions & 0 deletions docs/docs/runbooks/unhealthy_ring_node.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,5 @@ There is no generic way to correct the failure, it would depend on the reason fo

If the ring is otherwise healthy then you can simply 'Forget' the node by using the action on the appropriate state page
listed above. This will remove the node from the ring and redistribute the tokens.

{!_sections/bug_report.md!}

0 comments on commit 8c4f86b

Please sign in to comment.