Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Apply markdowlint rules in the Telemetry repository #916

Merged
merged 1 commit into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 28 additions & 0 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# This is a configuration file for the markdownlint. You can use this file to overwrite the default settings.
# MD013 is set to false by default because many files include lines longer than the conventional 80 character limit
MD013: false
# Disable the Multiple headings with the same content rule
MD024: false
# MD029 is set to false because it generated some issues with longer lists
MD029: false
# Checks if there some inline HTML-elements
MD033: false
# MD044 is used to set capitalization for the particular words. You can determine whether it should be used also for code blocks and html elements
MD044:
code_blocks: false
html_elements: false
names:
- Kyma
- Kubernetes
- ConfigMap
- CronJob
- CustomResourceDefinition
- Ingress
- Node
- PodPreset
- Pod
- ProwJob
- Secret
- ServiceBinding
- ServiceClass
- ServiceInstance
1 change: 1 addition & 0 deletions .markdownlintignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
_sidebar.md
6 changes: 5 additions & 1 deletion CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,8 @@
/docs/ @kyma-project/technical-writers

# All .md files
*.md @kyma-project/technical-writers
*.md @kyma-project/technical-writers

# Config files for markdownlint
.markdownlint.yaml @kyma-project/technical-writers
markdown_heading_capitalization.js @kyma-project/technical-writers
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 1. Trace/Metric Pipeline status based on OTel Collector metrics
# 1. Trace/Metric Pipeline Status Based on OTel Collector Metrics

Date: 2023-08-11

Expand All @@ -9,8 +9,8 @@ Proposed
## Context

The telemetry pipelines deploy active components to a cluster, and these components are responsible for dispatching data to third-party services.
Inevitably, interruptions can occur, potentially hindering the successful delivery of data.
To enhance short-term resilience, we implement retries and buffering mechanisms.
Inevitably, interruptions can occur, potentially hindering the successful delivery of data.
To enhance short-term resilience, we implement retries and buffering mechanisms.
Nevertheless, there may be scenarios where data cannot be delivered, requiring user notification for prompt action.

The typical situations causing problems are:
Expand All @@ -19,8 +19,8 @@ The typical situations causing problems are:
* Backpressure or throttling due to overloading the third-party service.
* Reaching the ingestion limit of a pipeline, which may occur even with auto-scaling in place (subject to a maximum replica setting).

These situations can be monitored by collecting metrics from the relevant components, with [recently added documentation](https://github.com/kyma-project/telemetry-manager/pull/423) to guide the process.
However, this approach can be cumbersome for users, as it requires them to identify which services to scrape and filter the relevant metrics.
These situations can be monitored by collecting metrics from the relevant components, with [recently added documentation](https://github.com/kyma-project/telemetry-manager/pull/423) to guide the process.
However, this approach can be cumbersome for users, as it requires them to identify which services to scrape and filter the relevant metrics.
Additionally, these details delve into internal aspects that may undergo changes in the future.

### Goal
Expand Down Expand Up @@ -61,20 +61,23 @@ Clear communication regarding its internal nature is crucial to avoid any misuse

### Integrating Prometheus Querying into Telemetry Manager Reconciliation Loop

#### Direct Prometheus Queries:
#### Direct Prometheus Queries

* Telemetry Manager manages a set of predefined PromQL queries and communicates with Prometheus using [expression queries API](https://prometheus.io/docs/prometheus/latest/querying/api/#expression-queries) to collect data points.
Subsequently, it utilizes this data to evaluate the status of Pipeline CRs.
Subsequently, it utilizes this data to evaluate the status of Pipeline CRs.
* However, it's crucial to note that if the querying process takes an extended duration, it may impact the reconciliation of TracePipeline and MetricPipeline.
![Prometheus Integration using Direct Queries](../assets/prom-integration-direct-queries-flow.svg "Prometheus Integration using Direct Queries")
* Using [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) can improve query performance.

#### Prometheus Queries with Additional Controller:
#### Prometheus Queries with Additional Controller

* Similar to [Direct Prometheus Queries](#direct-prometheus-queries), but it addresses the potential challenge of long-running queries, which could hinder reconciliation by introducing a separate PipelineMonitoring CR with its own reconciliation loop.
This separation effectively isolates the query execution and result storage from the TracePipeline and MetricPipeline controllers.
It's worth noting that while this approach enhances efficiency, it can introduce increased complexity to the overall setup.
![Prometheus Integration with Additional Controller](../assets/prom-integration-extra-ctrl-flow.svg "Prometheus Integration with Additional Controller")

#### Prometheus Alerts:
#### Prometheus Alerts

* In this approach, PromQL expressions are defined as Prometheus alerts, with Prometheus responsible for their evaluation.
* Telemetry Manager leverages the [alerts API](https://prometheus.io/docs/prometheus/latest/querying/api/#alerts) to periodically retrieve information regarding active alerts.
* This method offers several advantages, including the avoidance of potentially long-running queries that can hinder reconciliation, as the evaluation is offloaded to Prometheus.
Expand All @@ -85,8 +88,10 @@ Clear communication regarding its internal nature is crucial to avoid any misuse
## Decision

### Metric Analysis

Despite its inherent complexity, our preference leans toward running Prometheus as a standalone deployment. This approach offers the highest level of flexibility and ensures future-proofing, making it the most robust choice for our needs.
The following things have to be considered:

* To maintain security, we should make sure only Telemetry Manager can access Prometheus, and we can do this by deploying a NetworkPolicy.
* For cost-effectiveness, it's a good idea for Prometheus to use ephemeral storage.
* We only need to keep data for a short period, mainly focusing on current information and possibly the last few hours of historical data.
Expand All @@ -102,4 +107,4 @@ Using [Prometheus Queries with Additional Controller](#prometheus-queries-with-a
## Consequences

We can decide on how to implement [Integrating Prometheus Querying into Telemetry Manager Reconciliation Loop](#integrating-prometheus-querying-into-telemetry-manager-reconciliation-loop) by conducting a Proof of Concept (PoC). This involves running a load test with a significant number of OTel Collector replicas and various PromQL queries. The results from this testing will guide us in selecting the most suitable approach for our unique circumstances.
Future progress is tracked in https://github.com/kyma-project/telemetry-manager/issues/425.
Future progress is tracked in <https://github.com/kyma-project/telemetry-manager/issues/425>.
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The Fluent Bit configuration that the Telemetry Manager currently generates from

The persistent buffer can prevent log loss only for a short period of time. Depending on the amount of generated logs, this is usually in the range of minutes to a few hours and might be too short to restore an outage of a log backend or detect and solve a faulty configuration.

After the in-cluster Loki backend has been removed, the pipeline isolation requirements must be reevaluated in favor of pausing the input plugin. The typical setup are clusters with a single LogPipeline or multiple LogPipelines that ingest logs to the same backend; for example, ingesting application logs and Istio access logs into two different indexes of an [SAP Cloud Logging](../../user/integration/sap-cloud-logging) instance. With this typical setup, the benefit of pausing the `tail` input plugin exceeds the pipeline isolation requirements. Kubernetes' [logging architecture](https://kubernetes.io/docs/concepts/cluster-administration/logging/) helps by storing logs persistently on the node file-system and rotating them automatically after reaching a certain file size.
After the in-cluster Loki backend has been removed, the pipeline isolation requirements must be reevaluated in favor of pausing the input plugin. The typical setup are clusters with a single LogPipeline or multiple LogPipelines that ingest logs to the same backend; for example, ingesting application logs and Istio access logs into two different indexes of an [SAP Cloud Logging](../../user/integration/sap-cloud-logging) instance. With this typical setup, the benefit of pausing the `tail` input plugin exceeds the pipeline isolation requirements. Kubernetes' [logging architecture](https://kubernetes.io/docs/concepts/cluster-administration/logging/) helps by storing logs persistently on the Node file-system and rotating them automatically after reaching a certain file size.

## Decision

Expand All @@ -23,12 +23,12 @@ Fluent Bit can be configured in different ways to read container logs and ingest
3. **Dedicated log streams**: Each pipeline has its own `tail` input plugin, a list of filters, and an output plugin. This setup isolates the log processing between different pipelines. In the case of a problem, the streams can be paused individually. This setup has medium complexity.
4. **Dedicated Fluent Bit instances**: Each pipeline gets its own Fluent Bit DaemonSet. This setup isolates also the CPU and memory resources per pipeline with the cost of a larger overhead. This setup has medium complexity.

Currently, Telemetry Manager uses option 2.
Currently, Telemetry Manager uses option 2.
Option 1 does not fulfill our requirement to apply individual filters per pipeline. Option 4 causes an unacceptable resource overhead for our typical setup of two pipelines (application logs and access logs). Option 2 and 3 allow log filter settings per pipeline.

We consider option 3 to be the best Fluent Bit configuration for our requirements because of its lower complexity. The throughput of option 3 has shown to be better than option 2 without changing Fluent Bit's CPU and memory limits.
The persistent file-system buffer is still considered to be useful over pausing the `tail` input because the stored amount of logs per individual Pod is significantly lower with Kubernetes' built-in log storage. The directory-size-exporter also provides a better observability of log loss.

## Consequences

Switching from option 2 to option 3 relies on the container runtime for persistent log storage. This changes the conditions under which logs might be lost: Option 2, with its persistent buffer of 1 GB per pipeline, deletes the oldest logs when the limit is reached. For option 3, the conditions under which logs might be lost are specific for the individual pod, because log rotation happens when the log file per pod reaches a certain size. Logs are deleted in the case of pod evictions or multiple restarts. Node deletion might also be a reason for log loss. However, because the persistent buffer of the existing setup uses the host file system, this risk is already existing.
Switching from option 2 to option 3 relies on the container runtime for persistent log storage. This changes the conditions under which logs might be lost: Option 2, with its persistent buffer of 1 GB per pipeline, deletes the oldest logs when the limit is reached. For option 3, the conditions under which logs might be lost are specific for the individual Pod, because log rotation happens when the log file per Pod reaches a certain size. Logs are deleted in the case of Pod evictions or multiple restarts. Node deletion might also be a reason for log loss. However, because the persistent buffer of the existing setup uses the host file system, this risk is already existing.
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@ The results of the query tests affirm that invoking Prometheus APIs won't notabl
### Challenges

#### Timing of Invocation

Our current reconciliation strategy triggers either when a change occurs or every minute. While this is acceptable for periodic status updates, it may not be optimal when considering future plans to use Prometheus for autoscaling decisions.

#### Flakiness Mitigation

To ensure reliability and avoid false alerts, it's crucial to introduce a delay before signaling a problem. As suggested in [OTel Collector monitoring best practices](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md):

> Use the rate of otelcol_processor_dropped_spans > 0 and otelcol_processor_dropped_metric_points > 0 to detect data loss. Depending on requirements, set up a minimal time window before alerting to avoid notifications for minor losses that fall within acceptable levels of reliability.
Expand All @@ -34,15 +36,18 @@ If we directly query Prometheus, we would need to implement such a mechanism to
Fortunately, we can leverage the Alerting feature of Prometheus to address the aforementioned challenges. The proposed workflow is as follows:

#### Rendering Alerting Rules

Telemetry Manager dynamically generates alerting rules based on the deployed pipeline configuration.
These alerting rules are then mounted into the Prometheus Pod, which is also deployed by the Telemetry Manager.

#### Alert Retrieval in Reconciliation

During each reconciliation iteration, the Telemetry Manager queries the [Prometheus Alerts API](https://prometheus.io/docs/prometheus/latest/querying/api/#alerts) using `github.com/prometheus/client_golang` to retrieve information about all fired alerts.
The obtained alerts are then translated into corresponding CR statuses.

#### Webhook for Immediate Reconciliation
The Telemetry Manager exposes an endpoint intended to be invoked by Prometheus whenever there is a change in the state of alerts. To facilitate this, we can configure Prometheus to treat our endpoint as an Alertmanager instance. Upon receiving a call, this endpoint initiates an immediate reconciliation of all affected resources using the https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#Builder.WatchesRawSource with https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.16.3/pkg/source#Channel.

The Telemetry Manager exposes an endpoint intended to be invoked by Prometheus whenever there is a change in the state of alerts. To facilitate this, we can configure Prometheus to treat our endpoint as an Alertmanager instance. Upon receiving a call, this endpoint initiates an immediate reconciliation of all affected resources using the <https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#Builder.WatchesRawSource> with <https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.16.3/pkg/source#Channel>.

By adopting this approach, we transfer the effort associated with expression evaluation and waiting to Prometheus.

Expand Down
64 changes: 33 additions & 31 deletions docs/contributor/arch/004-consolidate-pipeline-statuses.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,41 +22,43 @@ What makes things tricky is that there may be Kyma customers who rely on the old
## Decision

Roll out the change for `LogPipeline` and `TracePipeline` in multiple phases:

1. Augment the existing custom condition structures with the missing required fields from `metav1.Condition`, like `Status` and `Message`. Make them optional. Extend the controllers' logic to populate them with reasonable values. We can roll that out without bumping the `v1alpha1` API group, because we will only add optional fields.
2. Replace custom condition structures with `metav1.Condition`. Technically, it's a breaking change, because some optional fields become required. However, we can maintain the API version because the status is automatically reconciled and should never be directly set by a user.
3. Set new conditions in the status (`GatewayHealthy`/`AgentHealthy`, `ConfigurationGenerated`) and append old conditions to the list (`Pending`, `Running`). This way, we preserve the semantic, because the user would still infer a pipeline healthiness from the last condition in the list. Migrate to new conditions in E2E tests and `Telemetry` state propagation. Deprecate the old conditions and announce it in the Release Notes.
```yaml
status:
conditions:
- lastTransitionTime: "2024-02-01T08:26:02Z"
message: Trace gateway Deployment is ready
observedGeneration: 1
reason: DeploymentReady
status: "True"
type: GatewayHealthy
- lastTransitionTime: "2024-01-24T14:36:58Z"
message: ""
observedGeneration: 1
reason: ConfigurationGenerated
status: "True"
type: ConfigurationGenerated
- lastTransitionTime: "2024-02-01T08:24:02Z"
message: Trace gateway Deployment is not ready
observedGeneration: 1
reason: DeploymentReady
status: "False"
type: Pending
- lastTransitionTime: "2024-02-01T08:26:02Z"
message: Trace gateway Deployment is ready
observedGeneration: 1
reason: DeploymentReady
status: "True"
type: Running
```
4. After the deprecation period, remove the old conditions from the list.

```yaml
status:
conditions:
- lastTransitionTime: "2024-02-01T08:26:02Z"
message: Trace gateway Deployment is ready
observedGeneration: 1
reason: DeploymentReady
status: "True"
type: GatewayHealthy
- lastTransitionTime: "2024-01-24T14:36:58Z"
message: ""
observedGeneration: 1
reason: ConfigurationGenerated
status: "True"
type: ConfigurationGenerated
- lastTransitionTime: "2024-02-01T08:24:02Z"
message: Trace gateway Deployment is not ready
observedGeneration: 1
reason: DeploymentReady
status: "False"
type: Pending
- lastTransitionTime: "2024-02-01T08:26:02Z"
message: Trace gateway Deployment is ready
observedGeneration: 1
reason: DeploymentReady
status: "True"
type: Running
```

1. After the deprecation period, remove the old conditions from the list.

## Consequences

The status structure will be consistent across all CRDs, which will simplify our monitoring, as well as propagating the status from telemetry pipelines to the `Telemetry` CR.
The status structure will be consistent across all CRDs, which will simplify our monitoring, as well as propagating the status from telemetry pipelines to the `Telemetry` CR.
It will also make it easier to extend statuses with new condition types.