Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nr-k8s-otel-collector] Updates on README and chart values #1363

Merged
merged 18 commits into from
May 16, 2024
Merged
6 changes: 4 additions & 2 deletions charts/nr-k8s-otel-collector/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
apiVersion: v2
name: nr-k8s-otel-collector
description: A Helm chart to deploy OpenTelemetry Collector as an agent.
description: A Helm chart to monitor a Kubernetes Cluster using an OpenTelemetry Collector.
home: https://github.com/newrelic/helm-charts
icon: https://newrelic.com/assets/newrelic/source/NewRelic-logo-square.svg

Expand All @@ -17,7 +17,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.1
version: 0.1.2

dependencies:
- name: common-library
Expand Down Expand Up @@ -46,3 +46,5 @@ keywords:
- infrastructure
- newrelic
- monitoring
- opentelemetry
- kubernetes
103 changes: 71 additions & 32 deletions charts/nr-k8s-otel-collector/README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,114 @@
# Installation:
[![Community Plus header](https://github.com/newrelic/opensource-website/raw/master/src/images/categories/Community_Plus.png)](https://opensource.newrelic.com/oss-category/#community-plus)

### 1. Download source code
```
git clone https://github.com/newrelic/helm-charts.git
```
# nr-k8s-otel-collector

A Helm chart to monitor a Kubernetes Cluster using an OpenTelemetry Collector.

# Helm installation

### 2. Update config [here](https://github.com/newrelic/helm-charts/tree/master/charts/nr-k8s-otel-collector/values.yaml#L20-L24) to add a cluster name, and New Relic Ingest - License key
Example:
Download and Update config [here](https://github.com/newrelic/helm-charts/tree/master/charts/nr-k8s-otel-collector/values.yaml#L20-L24) to add a cluster name, and New Relic Ingest - License key

Example:
```
licenseKey: "EXAMPLEINGESTLICENSEKEY345878592NRALL"
cluster: "SampleApp"
cluster: "SampleApp"
```

### 3. From the root directory of this chart, run:
```
helm install nr-k8s-otel-collector nr-k8s-otel-collector -n newrelic --create-namespace
You can install this chart using directly this Helm repository:

```shell
helm repo add newrelic https://helm-charts.newrelic.com
helm upgrade nr-k8s-otel-collector newrelic/nr-k8s-otel-collector -f your-custom-values.yaml -n newrelic --create-namespace --install
```

## Confirm installation
### Watch pods spin up:
### Watch pods spin up:

```
kubectl get pods -A --watch
kubectl get pods -n newrelic --watch
```
mangulonr marked this conversation as resolved.
Show resolved Hide resolved

### Check logs of opentelemetry pod that spins up:
### Check logs of opentelemetry pod that spins up:
```
kubectl logs <otel-pod-name> -n newrelic
```

### Confirm data coming through in New Relic
### Confirm data coming through in New Relic
You should see data reporting into New Relic within a couple of seconds to the `InfrastructureEvent` table, `Metric` table, and `Log` tables.
```
FROM Metric SELECT *
FROM Metric SELECT * WHERE k8s.cluster.name='<CLUSTER_NAME>'
```
```
FROM InfrastructureEvent SELECT *
FROM InfrastructureEvent SELECT * WHERE k8s.cluster.name='<CLUSTER_NAME>'
```
```
FROM Log SELECT *
FROM Log SELECT * WHERE k8s.cluster.name='<CLUSTER_NAME>'
```
mangulonr marked this conversation as resolved.
Show resolved Hide resolved
## Uninstall
mangulonr marked this conversation as resolved.
Show resolved Hide resolved

Run the following command.

## Development notes
### Iterating on otel config:
1. Make changes to the [opentelemetry configuration](https://github.com/newrelic/helm-charts/tree/master/charts/nr-k8s-otel-collector/templates/configmap.yaml#L6-L485)
2. Upgrade the release:
```
helm upgrade nr-k8s-otel-collector nr-k8s-otel-collector -n newrelic
helm uninstall nr-k8s-otel-collector -n newrelic
```

## Values managed globally

This chart implements the [New Relic's common Helm library](https://github.com/newrelic/helm-charts/tree/master/library/common-library) which
means that it honors a wide range of defaults and globals common to most New Relic Helm charts.

Options that can be defined globally include `affinity`, `nodeSelector`, `tolerations` and others. The full list can be found at
[user's guide of the common library](https://github.com/newrelic/helm-charts/blob/master/library/common-library/README.md).

## Values

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| affinity | object | `{}` | Sets pod/node affinities |
| cluster | string | `""` | Name of the Kubernetes cluster monitored. Mandatory. Can be configured also with `global.cluster` |
| customSecretLicenseKey | string | `""` | In case you don't want to have the license key in you values, this allows you to point to which secret key is the license key located. Can be configured also with `global.customSecretLicenseKey` |
| customSecretName | string | `""` | In case you don't want to have the license key in you values, this allows you to point to a user created secret to get the key from there. Can be configured also with `global.customSecretName` |
| image.pullPolicy | string | `"IfNotPresent"` | The pull policy is defaulted to IfNotPresent, which skips pulling an image if it already exists. If pullPolicy is defined without a specific value, it is also set to Always. |
| image.repository | string | `"otel/opentelemetry-collector-contrib"` | OTel collector image to be deployed. You can use your own collector as long it accomplish the following requirements mentioned below. |
| image.tag | string | `"0.91.0"` | Overrides the image tag whose default is the chart appVersion. |
| kube-state-metrics.enabled | bool | `true` | Install the [`kube-state-metrics` chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics) from the stable helm charts repository. This is mandatory if `infrastructure.enabled` is set to `true` and the user does not provide its own instance of KSM version >=1.8 and <=2.0. Note, kube-state-metrics v2+ disables labels/annotations metrics by default. You can enable the target labels/annotations metrics to be monitored by using the metricLabelsAllowlist/metricAnnotationsAllowList options described [here](https://github.com/prometheus-community/helm-charts/blob/159cd8e4fb89b8b107dcc100287504bb91bf30e0/charts/kube-state-metrics/values.yaml#L274) in your Kubernetes clusters. |
| kube-state-metrics.prometheusScrape | bool | `false` | Disable prometheus from auto-discovering KSM and potentially scraping duplicated data |
| licenseKey | string | `""` | This set this license key to use. Can be configured also with `global.licenseKey` |
| nodeSelector | object | `{}` | Sets pod's node selector. Can be configured also with `global.nodeSelector |
| nrStaging | bool | `false` | Send the metrics to the staging backend. Requires a valid staging license key. Can be configured also with `global.nrStaging` |
| podAnnotations | object | `{}` | Annotations to be added to each pod created by the chart |
| podSecurityContext | object | `{}` | Sets security context (at pod level). Can be configured also with `global.podSecurityContext` |
| resources | object | `{}` | The default set of resources assigned to the pods is shown below: |
| securityContext | object | `{"privileged":true}` | Sets security context (at container level). Can be configured also with `global.podSecurityContext` |
| tolerations | list | `[]` | Sets pod's tolerations to node taints. Cab be configured also with `global.tolerations` |
| verboseLog | bool | `false` | Sets the debug logs to this integration or all integrations if it is set globally. Can be configured also with `global.verboseLog` |

## Common Errors

### Exporting Errors

Timeout errors while starting up the collector are expected as the collector attempts to establish a connection with NR.
Timeout errors while starting up the collector are expected as the collector attempts to establish a connection with NR.
These timeout errors can also pop up over time as the collector is running but are transient and expected to self-resolve. Further improvements are underway to mitigate the amount of timeout errors we're seeing from the NR1 endpoint.

```
info exporterhelper/retry_sender.go:154 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "failed to make an HTTP request: Post \"https://staging-otlp.nr-data.net/v1/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "5.445779213s"}
info exporterhelper/retry_sender.go:154 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "failed to make an HTTP request: Post \"https://staging-otlp.nr-data.net/v1/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "5.445779213s"}
```

### No such file or directory
### No such file or directory

Sometimes we see failed to open file errors on the `filelog` and `hostmetrics` receiver because of a race condition where the file or directory no longer exists, as the pod or process was ephemeral (e.g. a cronjob, sleep) and the pod or process was terminated before the collector could read the file.
Sometimes we see failed to open file errors on the `filelog` and `hostmetrics` receiver because of a race condition where the file or directory no longer exists, as the pod or process was ephemeral (e.g. a cronjob, sleep) and the pod or process was terminated before the collector could read the file.

`filelog` error:
`filelog` error:
```
Failed to open file {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "error": "open /var/log/pods/<podname>/<containername>/0.log: no such file or directory"}
```
Failed to open file {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "error": "open /var/log/pods/<podname>/<containername>/0.log: no such file or directory"}
```
`hostmetrics` error:
```
Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "data_type": "metrics", "error": "error reading <metric> for process \"<process>\" (pid <PID>): open /hostfs/proc/<PID>/stat: no such file or directory; error reading <metric> info for process \"<process>\" (pid 511766): open /hostfs/proc/<PID>/<metric>: no such file or directory", "scraper": "process"}
```
Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "data_type": "metrics", "error": "error reading <metric> for process \"<process>\" (pid <PID>): open /hostfs/proc/<PID>/stat: no such file or directory; error reading <metric> info for process \"<process>\" (pid 511766): open /hostfs/proc/<PID>/<metric>: no such file or directory", "scraper": "process"}
```

## Maintainers

* [juanjjaramillo](https://github.com/juanjjaramillo)
* [csongnr](https://github.com/csongnr)
* [dbudziwojskiNR](https://github.com/dbudziwojskiNR)
104 changes: 104 additions & 0 deletions charts/nr-k8s-otel-collector/README.md.gotmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
[![Community Plus header](https://github.com/newrelic/opensource-website/raw/master/src/images/categories/Community_Plus.png)](https://opensource.newrelic.com/oss-category/#community-plus)


{{ template "chart.header" . }}
{{ template "chart.deprecationWarning" . }}

{{ template "chart.description" . }}

# Helm installation

Download and Update config [here](https://github.com/newrelic/helm-charts/tree/master/charts/nr-k8s-otel-collector/values.yaml#L20-L24) to add a cluster name, and New Relic Ingest - License key

Example:
```
licenseKey: "EXAMPLEINGESTLICENSEKEY345878592NRALL"
cluster: "SampleApp"
```

You can install this chart using directly this Helm repository:

```shell
helm repo add newrelic https://helm-charts.newrelic.com
helm upgrade --install newrelic/nr-k8s-otel-collector -f your-custom-values.yaml -n newrelic --create-namespace
```

{{ template "chart.sourcesSection" . }}

## Confirm installation
### Watch pods spin up:

```
kubectl get pods -n newrelic --watch
```

### Check logs of opentelemetry pod that spins up:
```
kubectl logs <otel-pod-name> -n newrelic
```

### Confirm data coming through in New Relic
You should see data reporting into New Relic within a couple of seconds to the `InfrastructureEvent` table, `Metric` table, and `Log` tables.
```
FROM Metric SELECT * WHERE k8s.cluster.name='<CLUSTER_NAME>'
```
```
FROM InfrastructureEvent SELECT * WHERE k8s.cluster.name='<CLUSTER_NAME>'
```
```
FROM Log SELECT * WHERE k8s.cluster.name='<CLUSTER_NAME>'
```
## Uninstall

Run the following command.

```
helm uninstall nr-k8s-otel-collector -n newrelic
```

## Values managed globally

This chart implements the [New Relic's common Helm library](https://github.com/newrelic/helm-charts/tree/master/library/common-library) which
means that it honors a wide range of defaults and globals common to most New Relic Helm charts.

Options that can be defined globally include `affinity`, `nodeSelector`, `tolerations` and others. The full list can be found at
[user's guide of the common library](https://github.com/newrelic/helm-charts/blob/master/library/common-library/README.md).

{{ template "chart.valuesSection" . }}

## Common Errors

### Exporting Errors

Timeout errors while starting up the collector are expected as the collector attempts to establish a connection with NR.
These timeout errors can also pop up over time as the collector is running but are transient and expected to self-resolve. Further improvements are underway to mitigate the amount of timeout errors we're seeing from the NR1 endpoint.

```
info exporterhelper/retry_sender.go:154 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "failed to make an HTTP request: Post \"https://staging-otlp.nr-data.net/v1/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "5.445779213s"}
```

### No such file or directory

Sometimes we see failed to open file errors on the `filelog` and `hostmetrics` receiver because of a race condition where the file or directory no longer exists, as the pod or process was ephemeral (e.g. a cronjob, sleep) and the pod or process was terminated before the collector could read the file.

`filelog` error:
```
Failed to open file {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "error": "open /var/log/pods/<podname>/<containername>/0.log: no such file or directory"}
```
`hostmetrics` error:
```
Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "data_type": "metrics", "error": "error reading <metric> for process \"<process>\" (pid <PID>): open /hostfs/proc/<PID>/stat: no such file or directory; error reading <metric> info for process \"<process>\" (pid 511766): open /hostfs/proc/<PID>/<metric>: no such file or directory", "scraper": "process"}
```

{{ if .Maintainers }}
## Maintainers
{{ range .Maintainers }}
{{- if .Name }}
{{- if .Url }}
* [{{ .Name }}]({{ .Url }})
{{- else }}
* {{ .Name }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
15 changes: 13 additions & 2 deletions charts/nr-k8s-otel-collector/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,15 @@ kube-state-metrics:
# metrics by default. You can enable the target labels/annotations metrics to be monitored by using the metricLabelsAllowlist/metricAnnotationsAllowList options described [here](https://github.com/prometheus-community/helm-charts/blob/159cd8e4fb89b8b107dcc100287504bb91bf30e0/charts/kube-state-metrics/values.yaml#L274) in
# your Kubernetes clusters.
enabled: true
prometheusScrape: false # Disable prometheus from auto-discovering KSM and potentially scraping duplicate data.
# -- Disable prometheus from auto-discovering KSM and potentially scraping duplicated data
prometheusScrape: false

image:
# -- OTel collector image to be deployed. You can use your own collector as long it accomplish the following requirements mentioned below.
repository: otel/opentelemetry-collector-contrib
# -- The pull policy is defaulted to IfNotPresent, which skips pulling an image if it already exists. If pullPolicy is defined without a specific value, it is also set to Always.
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
# -- Overrides the image tag whose default is the chart appVersion.
tag: "0.91.0"

# -- Name of the Kubernetes cluster monitored. Mandatory. Can be configured also with `global.cluster`
Expand All @@ -25,11 +28,14 @@ customSecretName: ""
# -- In case you don't want to have the license key in you values, this allows you to point to which secret key is the license key located. Can be configured also with `global.customSecretLicenseKey`
customSecretLicenseKey: ""

# -- Annotations to be added to each pod created by the chart
podAnnotations: {}

# -- Sets security context (at pod level). Can be configured also with `global.podSecurityContext`
podSecurityContext: {}
# fsGroup: 2000

# -- Sets security context (at container level). Can be configured also with `global.podSecurityContext`
securityContext:
privileged: true
# capabilities:
Expand All @@ -39,6 +45,7 @@ securityContext:
# runAsNonRoot: true
# runAsUser: 1000

# -- The default set of resources assigned to the pods is shown below:
resources: {}
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
Expand All @@ -51,15 +58,19 @@ resources: {}
# cpu: 100m
# memory: 128Mi

# -- Sets pod's node selector. Can be configured also with `global.nodeSelector
nodeSelector: {}

# -- Sets pod's tolerations to node taints. Cab be configured also with `global.tolerations`
tolerations: []

# -- Sets pod/node affinities
affinity: {}

# -- (bool) Sets the debug logs to this integration or all integrations if it is set globally. Can be configured also with `global.verboseLog`
# @default -- `false`
verboseLog:

# -- (bool) Send the metrics to the staging backend. Requires a valid staging license key. Can be configured also with `global.nrStaging`
# @default -- `false`
nrStaging:
Loading