# Effective Observability on Kubernetes

## What is observability?

Observability (sometimes referred to as o11y) is the concept of gaining an understanding into the behavior and performance of applications and systems. Observability starts by collecting system telemetry data, such as logs, metrics, and traces. More important is how that telemetry is analyzed to diagnose issues, understand the interconnectivity of dependencies, and ensure reliability. - [Honeycomb](https://www.honeycomb.io/blog/what-is-observability-key-components-best-practices)


## My Take on Observability

Observability simply means how much you know about your IT system behaves and performs.


## Why Observability Matters?

* You cannot operate a system that you cannot observe or measure.
* Empirically speaking there is a strong correlation between level of observability and MTTR (Mean Time To Repair).
* Good level of observability brings credibility.
* Good level of observability enables continuous delivery.


## Why observability is hard on Kubernetes?

* Kubernetes is a complex system with many sub-components and moving parts.
* The "cloud-native" ecosystem introduces even more surface areas that need to be observable.
* There are many observability solutions in the market, which makes it hard to make choices and know where to start, and very easy to veer off the beaten track.


## What we will cover

* How to setup an opinionated production grade observability stack on Kubernetes using Grafana Alloy.
* How to effectively instrument your application using Traces, Metrics and Logs with the help of Grafana Alloy.
* Deeper insights into your stack using profiling and flamegraphs.

### Under the assumptions of...

* You use Prometheus-esque TSDB solution.
* Your trace backend is OTel compatible.


## What we will not cover

* Production grade Mimir/Tempo deployment, they deserve a talk on their own.
* Tenant onboarding and lifecycle management on a multi-tenant observability platform. This talk very much focuses on the essence of observability itself.

## Also...

I am Jingkai He 👋

* I'm an independent software consultant based in London.
* I have setup 3 Grafana LGTM-based observability platforms for customers ranging from startups to large enterprises in the past 2 years.
* I enjoy crafting software and investigating production crime scenes.
* Currently on sabbatical.
* I've spent 860 hrs+ of my life on Total War Warhammer.


## What is Grafana Alloy?

> Alloy offers native pipelines for OTel, Prometheus, Pyroscope, Loki, and many other metrics, logs, traces, and profile tools. 

> ... Alloy is fully compatible with the OTel Collector, Prometheus Agent, and Promtail. You can use Alloy as an alternative to either of these solutions or combine it into a hybrid system of multiple collectors and agents.

From [Grafana Alloy](https://grafana.com/docs/alloy/latest/) official documentation.

**Disclaimer**: Despite the vendor neutrality claimed by Grafana, I have not verified it.



### High Level Architecture

```mermaid
flowchart TD
    subgraph Applications["Kubernetes Applications"]
        App1["Application"]
    end

        
    subgraph ObservabilityNamespace["Observability Namespace"]
        direction TB
        subgraph SupplimentaryServices["Supplimentary Services"]
            direction TB
            KSM["Kube State Metrics\nDeployment"]
            NodeExporter["Node Exporter\nDaemonSet"]
        end
        subgraph AlloySTS["Alloy StatefulSet"]
          AlloyProm["Prom Collector (metrics)"]
          AlloyOtel["Otel Collector (traces)"]
        end
        AlloyLog["Alloy Logs Collector\nDaemonSet\n(Direct read from /var/logs/pods...)"]
        AlloyProfiler["Alloy Profiler\nDaemonSet\n(Support ebpf-based profiling)"]
    end

    subgraph Storage["LGTM Stack"]
        direction LR
        Mimir["Mimir (Metrics)"]
        Tempo["Tempo (Traces)"]
        Loki["Loki (Logs)"]
        Pyroscope["Pyroscope (Continuous profiling)"]
        Grafana["Grafana\nDashboards"]
        AlertManager["Alertmanager"]
    end

    %% Application connections to collectors
    App1 -- HTTP(s) scrape --> AlloyProm
    App1 -- otelp --> AlloyOtel
    App1 -- HTTP push-based or pprof-scraping-based profiling --> AlloyProfiler


    %% Remote connections
    AlloyProm -- Remote Write --> Mimir
    AlloyLog --> Loki
    AlloyOtel -- GRPC --> Tempo
    AlloyProfiler -- Remote Write --> Pyroscope

    %% Visualization connections
    Mimir --> Grafana
    Tempo --> Grafana
    Loki --> Grafana
    Mimir --> AlertManager
    AlertManager --> Grafana

    %% Styling
    classDef k8s fill:#326CE5,color:white
    classDef metrics fill:#E6522C,color:white
    classDef logs fill:#66B16E,color:white
    classDef traces fill:#7D4CDB,color:white
    classDef profiles fill:#FFA500,color:white
    classDef viz fill:#F8413C,color:white

    class App1,App2,App3,KSM,NodeExporter k8s
    class Mimir,AlloyProm metrics
    class Loki,AlloyLog logs
    class Tempo,AlloyOtel traces
    class Pyroscope,AlloyProfiler profiles
    class Grafana,AlertManager viz
```


## Alloy Pros and Cons

Pros:
* Opinionated (vs [opentelemetry-operator](https://github.com/open-telemetry/opentelemetry-operator)) and easy to setup and get started.
* IMO much understandable OTEL pipeline configuration.
* Top tier integration with LGTM stack.
* Support eBPF based profiling.

Cons:
* Documentation has always been playing catch up with the product, but it's getting better.
* It comes with a HCL-esque DSL called `river` that you will need to learn, but it's not too complicated.

## Setup

Pretty much like this: https://til.jingkaihe.com/docs/observability-from-hero-to-zero-part-i/#step-2-deploy-the-grafana-alloy-stack-into-your-k8s-cluster

Some advices:

* https://github.com/grafana/k8s-monitoring-helm/tree/main/charts/k8s-monitoring-v1 is an epic hidden gem. Use it rather than handcrafting your o11y pipeline.
* If you are unfamiliar with the LGTM stack or the vendor specific o11y stack, try to rollout logs, metrics, traces and profiles in phases.
* Be very hardline with governance labels and attributes enforcement but loose on cardinality guardrails, so that you enable tenants to self-serve and learn at their own pace.
* Stop overthinking and just do it :)



## Customisation for Traces

Governance trace attributes:

```terraform

variable "traces_external_attributes" {
  description = "external attributes for traces"
  type        = map(string)

  default = {
    product_group = "xxx"
  }
}

// In your helm config:
      traces = {
        enabled = true
        # traces.receiver.transforms
        receiver = {
          transforms = {
            span = [
              for key, value in var.traces_external_attributes : "set(attributes[\"${key}\"], \"${value}\")"
            ]
          },
          // you can also use `filters` to drop that doesn't have mandatory attributes
        }
      },
```


## Customisation for Metrics

Removing high cardinality labels (practically it can be highly problematic due to dimension collapsing and causes a lot of duplicated metrics errors):

```terraform

variable "metrics_label_drops" {
  description = "labels to drop from metrics"
  type        = list(string)

  default = [
    "id",
    "uuid",
    "date",
    "uid",
    "container_id",
    "application_id",
    "created_at",
    "client_id",
    "pod_ip",
    "image_id",
    "resource_id",
    "account_id",
  ]
}

// In your helm config:
      metrics = {
        cost = {
          enabled = false
        },
        extraMetricRelabelingRules = <<EOT
        rule {
          action = "labeldrop"
          regex = "${join("|", var.metrics_label_drops)}"
        }
        EOT
      },
```


## Customisation for Profiling


```terraform
      profiles = {
        enabled = true
        java = {
          enabled = false
        }
        pprof = {
          // We are not interested in profiling istio!
          extraRelabelingRules = <<-EOT
rule {
  action = "drop"
  source_labels = ["__meta_kubernetes_pod_container_name"]
  regex = "(istio-init|istio-proxy)" 
}
rule {
  action = "labeldrop"
  regex = "${join("|", var.profiles_label_drops)}"
}
          EOT
        }
        ebpf = {
          enabled    = true
          namespaces = ["namespace-a", "namespace-b"] // Useful for pilot usage of eBPF-based profiling
        }
      },
```

## Effective application logging

### Why logs are useful?

Tell you how things are going on the transactional level. 
IMO it's the cheapest and easiest way to get observability.

### Some of the good practices

* Just logging to stdout/stderr and have the log-agent daemonset takes care of the log collection.
* Do structured logging:
  * Human readable.
  * Log engine friendly (ideally use logfmt or JSON format).
  * Treat logs as slice and dice-able events.

### Structured logging

```go
// good because:
// 1. Relatively human readable: `message="User signed up" user_id=123 username="jingkai"`
// 2. Highly searchable and slice and dice-able as easy as `{service="some-service", user_id="123"} | logfmt`
logger.G(ctx).WithField("user_id", 123).WithField("username", "jingkai").Info("User signed up")

// bad because:
// 1. Read like human language but extremely machine unfriendly when it comes to indexing and searching.
// 2. You end up with pattern matching and regex to search for things.
logger.Info("Username %s with id %d has signed up", "jingkai", 123)
```


### Caveats

* `json` format log can be a killer to your application performance when you have a excessive amount of logs. It is because `json.Marshal` uses CPU cycles.
* `stdout/stderr` can also be a silent killer to your application performance, and is extremely hard to troubleshoot due to reasons of off-CPU nature and dirty write-back.
* Overall just avoid excessive logging.


## Effective metrics instrumentation

### Why metrics are useful?

Great for measuring QoS and signalling potential issues.


### What we can measure with metrics?

* Latency
* Error rate
* Throughput
* Resource utilisation (CPU, memory, disk, network)
* Queue length
* ...


## Categorise metrics

### RED method

What I use mostly these days since it reflects user experience, thus more SLI/SLO oriented.

This is what we will be focusing on in this talk.

* Rate: Request per second
* Error: Error count per second
* Duration: Latency


### USE method

Useful if you are part of a computing team that manages the underlying infrastructure, but IMO it's becoming less useful if you are using commodtised Kubernetes services.

Majority of the telemetry data are scrapped by alloy from node-exporter and cadvisor metrics endpoints.

* Utilisation: Resource utilisation (CPU, memory, disk, network)
* Saturation: How full is your system (e.g. queue length, CPU, memory saturation)
* Errors: Number of error events (e.g. page fault, soft kernel panic, etc)

## How to collect RED metrics: The low-hanging fruit

Scrape metrics from your edge ingress controller/API gateway, NOW!

The latency, throughput and error rate from the edge directly translates to user experience, which is invaluable from the perspective of measuring SLI/SLO.

```terraform
  podAnnotations = {
    "k8s.grafana.com/scrape"           = "true"
    "k8s.grafana.com/job"              = "integrations/ingress-nginx"
    "k8s.grafana.com/metrics.path"     = "/metrics"
    "k8s.grafana.com/metrics.portName" = "metrics"
    "k8s.grafana.com/metrics.scheme"   = "http"
  }
```


## How to collect RED metrics from your application?

```go
package metrics

import (
	"context"
	"net/http"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/collectors"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	Registry          = prometheus.NewRegistry()
	HttpRequestsTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total number of HTTP requests",
		},
		[]string{"code", "method", "path"},
	)
	HttpRequestDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "Duration of HTTP requests",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"code", "method", "path"},
	)
)

func Init(ctx context.Context) {
	Registry.MustRegister(HttpRequestsTotal)
	Registry.MustRegister(HttpRequestDuration)
	http.Handle("/metrics", promhttp.HandlerFor(
		Registry,
		promhttp.HandlerOpts{
			EnableOpenMetrics: true,
		},
	))

	logger.G(ctx).Info("Starting metrics server on :8081")
	go http.ListenAndServe("0.0.0.0:8081", nil)
}
```

## ... and you can instrument your middleware with

```go
func Observe(path string) gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()

        defer func() {
            status := c.Writer.Status()
            metrics.HttpRequestsTotal.WithLabelValues(strconv.Itoa(status), c.Request.Method, path).Inc()
            metrics.HttpRequestDuration.WithLabelValues(strconv.Itoa(status), c.Request.Method, path).Observe(time.Since(start).Seconds())
        }()

        c.Next()
    }
}

// how it can be used by a middleware
r.GET("/api/user/:id", metrics.Observe("/api/user/:id"), handler)
```

## Some of the useful metrics queries

* `sum by (method, path) (rate(http_requests_total{cluster="$cluster", job="$job"}[$__rate_interval]))` - Rate (throughput) of the given service.
* `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{cluster="$cluster", job="$job"}[$__rate_interval])) by (le))` - Latency at 99th percentile.
* `sum (rate(http_requests_total{cluster="$cluster", job="$job", code=~"5.."}[$__rate_interval])) / sum(rate(http_requests_total{cluster="$cluster", job="$job"}[$__rate_interval]))` - Service error rate.
* `(sum(rate(http_request_duration_seconds_bucket{cluster="$cluster", job="$job", le="0.3"}[$__rate_interval])) + sum(rate(http_request_duration_seconds_bucket{cluster="$cluster", job="$job", le="0.6"}[$__rate_interval]))) / 2 / sum(rate(http_request_duration_seconds_count{cluster="$cluster", job="$job"}[$__rate_interval]))` - [Apdex score](https://en.wikipedia.org/wiki/Apdex).



## How to collect the metrics (the "legacy" way)

Use `PodMonitor` CRD to collect metrics from your application.

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-app
  namespace: the-namespace
spec:
  selector:
    matchLabels:
      app: my-app
  podMetricsEndpoints:
    - port: metrics # this is a NAME instead of a NUMBER!!!
```

```yaml
containers:
- name: my-app
  ports:
  - name: metrics
    port: 8081
    protocol: TCP
```

## How to collect the metrics (the magic way)

Using the undocumented `k8s.grafana.com/job` annotation to collect metrics from your application.

```yaml
annotations:
  k8s.grafana.com/job: my-app
  k8s.grafana.com/scrape: "true"
  k8s.grafana.com/metrics.path: /metrics
  k8s.grafana.com/metrics.port: metrics
  k8s.grafana.com/scrape.scheme: http
  k8s.grafana.com/metrics.scrapeInterval: 60s
```

## Effective tracing instrumentation

### Why tracing is useful?

* Provides you with high-cardinality attributes/context comes with a data flow.
* You can find the root cause of an issue via slice and dice the high cardinality attributes.
* It allows you to correlate issues across services


## How to trace - The easy way

Use an auto-instrumentation library or in the case of Go you can auto-instrument the binary using a sidecar container as shown below.

This is useful to get quick wins and stakeholders buy-in without too much up-front investment, but from my experience it does not correlate between services at all...

```terraform
      containers = [
        {
          name  = "go-auto-otel"
          image = "otel/autoinstrumentation-go:v0.13.0-alpha"
          env = [
            {
              name  = "OTEL_EXPORTER_OTLP_ENDPOINT"
              value = "http://grafana-k8s-monitoring-grafana-agent.observability.svc.cluster.local:4318"
            },
            {
              name = "OTEL_SERVICE_NAME"
              value = "my-app"
            },
            {
              name = "OTEL_GO_AUTO_TARGET_EXE"
              value = "/app/my-app"
            }
          ]
          securityContext = {
            runAsUser  = 0
            privileged = true
          }
        }
```

## How to trace - The artisan way

Use Otel SDK to manually instrument your application - https://til.jingkaihe.com/docs/observability-from-zero-to-hero-part-iv/#setup-otel-sdk

To push the traces to the alloy you need to specify

```terraform
   - name: OTEL_EXPORTER_OTLP_INSECURE
      value: "true"
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: http://grafana-k8s-monitoring-grafana-agent.observability.svc.cluster.local:4318
    - name: OTEL_EXPORTER_OTLP_PROTOCOL
      value: grpc
    - name: OTEL_SERVICE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['app']
```
as environment variables in your container spec.


## Caveats

* To makes Otel useful you need to use 3rd party libraries that support it:
  * [otelhttp](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/main/instrumentation/net/http/otelhttp) over standard `net/http` library.
  * [gorm](https://github.com/go-gorm/opentelemetry) over `mongo-driver`.
  * [ginotel](https://github.com/itsubaki/ginotel).
  * [otelsql](https://github.com/XSAM/otelsql) over `database/sql`
  * ...
* Also in Go always remember to populate the span context all the way down in your stack.
* You probably want to sample traces on a great scale for the purpose of reduce cost and avoid overwhelming the backend.





## Effective Continuous Profiling using Pyroscope

### Why continuous profiling is useful?

* It is the way to fully understand the performance issue from the point of view of cpu, memory, threads, block, mutex etc at the level of application runtime.
* It eliminates the need to eyeballing and guessing where the performance issue comes from.
* Continuous profiling eliminates the painful experience of `git clone https://github.com/brendangregg/FlameGraph` into privileged container and run arbitrary profiling scripts.

### How the continuous profiling works?

* Instrumented application runtimes are sampled at a low frequency, with the samples sent to the backend.
* Profiles are analysed (typically() in the form of flamegraphs.




## Collect profiles using Alloy - The easy way

* The easiest way to get started with continuous profiling is to use the eBPF based profiling.
* From my experience it collects Golang and Python based applications out of the box.

### Caveats

* Only CPU profiles are supported.

## Other ways to collect profiles

* Push-based profiling - https://grafana.com/docs/pyroscope/latest/configure-client/language-sdks/go_push/
* pprof-scraping-based profiling - https://grafana.com/docs/pyroscope/latest/configure-client/grafana-alloy/go_pull/#expose-pprof-endpoints It's very much a 3 liner to setup.

Configuration for pprof-scraping-based profiling on Kubernetes is very labourious that being said it can easily be optimised via kustomise or sidecar injection.

```terraform
    spec = {
      annotations = {
        "profiles.grafana.com/goroutine.scrape"    = "true"
        "profiles.grafana.com/goroutine.port_name" = "pprof"
        "profiles.grafana.com/goroutine.scheme"    = "http"
        "profiles.grafana.com/goroutine.path"      = "/debug/pprof/goroutine"
        "profiles.grafana.com/block.scrape"        = "true"
        "profiles.grafana.com/block.port_name"     = "pprof"
        "profiles.grafana.com/block.scheme"        = "http"
        "profiles.grafana.com/block.path"          = "/debug/pprof/block"
        "profiles.grafana.com/mutex.scrape"        = "true"
        "profiles.grafana.com/mutex.port_name"     = "pprof"
        "profiles.grafana.com/mutex.scheme"        = "http"
        "profiles.grafana.com/mutex.path"          = "/debug/pprof/mutex"
        "profiles.grafana.com/fgprof.scrape"       = "false"
        "profiles.grafana.com/memory.scrape"       = "true"
        "profiles.grafana.com/memory.port_name"    = "pprof"
        "profiles.grafana.com/memory.scheme"       = "http"
        "profiles.grafana.com/memory.path"         = "/debug/pprof/heap"
        "profiles.grafana.com/cpu.scrape"          = "true"
        "profiles.grafana.com/cpu.port_name"       = "pprof"
        "profiles.grafana.com/cpu.scheme"          = "http"
        "profiles.grafana.com/cpu.path"            = "/debug/pprof/profile"
      }
```

## Demo Time