Skip to content

Commit

Permalink
chore(OMA): Hero edits
Browse files Browse the repository at this point in the history
  • Loading branch information
urbiz-nr committed Mar 22, 2022
1 parent 1ac3f6f commit 48b7d91
Showing 1 changed file with 15 additions and 60 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,16 @@ import optimizingicon from './images/oma-oe-dg-optimizing-icon.png'
## Desired outcome [#desired-outcome]
Maximize the observability value of your data by optimizing data ingest. Reduce non-essential ingest data so you can stay within your budget.

## Process
## Process [#process]

[Prioritize your observability objectives](#prioritize-objectives)
[Develop an optimization plan](#develop-plan)
[Use data reduction techniques to execute your plan](#use-reduction-techniques)


### Prioritize your observability objectives [#prioritize-objectives]

One of the most important parts of the data governance framework is to align collected telemetry with *observability value drivers*. You need to ensure that you understand the primary observability objective is when you configure new telemetry.


When you introduce new telemetry you want to understand what it delivers to your overall observability solution. Your new data might overlap with other data. If you consider introducing telemetry that you can't align to any of the key objectives you may reconsider introducing that data.

Objectives include:
Expand All @@ -55,11 +54,9 @@ Alignment to these objectives are what allow you to make flexible and intuitive

In this section you'll make two core assumptions:


- You have the tools and techniques from the [Baseline your data ingest](/docs/new-relic-solutions/observability-maturity/operational-efficiency/dg-baselining) section to have a proper accounting of where our ingset comes from.
- You have a good understanding of the [observability maturity value drivers](https://docs.newrelic.com/docs/new-relic-solutions/observability-maturity/introduction/). This will be crucial in applying a value and a priority to groups of telemetry


Use the following examples to help you visualize how you would assess your own telemetry ingest and make the sometimes hard decisions that are needed to get within budget. Although each of these examples tries to focus on a value driver, most instrumentation serves more than one value driver. This is the hardest part of data ingest governance.

<CollapserGroup>
Expand All @@ -68,10 +65,8 @@ Use the following examples to help you visualize how you would assess your own t
title="Example 1: Focus on uptime and reliability"
>


An account is ingesting about 20% more than they had budgeted for. They have been asked by a manager to find some way to reduce consumption. Their most important value driver is `Uptime, performance, and reliability`


<ImageSizing width="500px" height="500px" verticalAlign="middle">
![Value Drivers Uptime](images/oma-oe-dg-value-driver-uptime.png)
</ImageSizing>
Expand All @@ -85,7 +80,6 @@ Their estate includes:
- K8s monitoring (dev, staging, prod)
- Logs (dev, staging, prod - including debug)


<Callout variant='IMPORTANT' title='Optimization plan'>
- Omit debug logs (knowning they can be turned on if there is an issue) (saves 5%)
- Omit several K8s state merics which are not required to display the Kubernetes Cluster Explore (saves 10%)
Expand Down Expand Up @@ -201,22 +195,19 @@ The different approaches:

The volume of data generated by the APM agent will be determined by several factors:


- The amount of organic traffic generated by the application (i.e, all things being equal an application being called one million times per day will generate more data than one being called one thousand times per day)
- Some of the characteristics of the underlying transaction data itself (length and complexity of URLs)
- Whether the application is reporting database queries
- Whether the application has transactions with many (or any) custom attributes
- The error volume for the application
- Whether the application agent is configured for distributed tracing


### Managing volume

While you can assume that all calls to an application are needed to support the business, it is possible that you could be more thrifty in your overall architecture. In an extreme case you may have a user profile microservice that is called every 10 seconds by its clients. This helps reduce latency if some user information is updated by other clients. However, one lever we have is reducing the frequency of calls to this service to for example every minute.

### Custom attributes


Any [custom attributes](/docs/data-apis/custom-data/custom-events/collect-custom-attributes/) added using a call to an APM API [addCustomParameter](https://developer.newrelic.com/collect-data/custom-attributes/) will add an additional attribute to the transaction payload. These are often useful, but as application logic and priorities change the data can be come less valuable or even obsolete.

The Java agent captures the following request.headers by default:
Expand All @@ -231,7 +222,6 @@ Developers might also use `addCustomParameter` to capture additional (potentiall

For an example of the rich configuration that is available in relation to APM see our [Java agent documentation](/docs/apm/agents/java-agent/attributes/java-agent-attributes/#requestparams)


### Error events

It is possible to determine how errors will be handled by APM. This can reduce the volume of data in some cases. For example there may be a high volume, but harmless error that cannot be removed at the present time.
Expand Down Expand Up @@ -369,7 +359,6 @@ Under *Ajax request deny list*, add the filters you would like to apply to your
Select *Save application settings* to update the agent configuration.
Redeploy the browser agent (either restarting the associated APM agent or updating the copy/paste browser installation).


### Validating

```
Expand All @@ -391,7 +380,6 @@ FROM AjaxRequest SELECT * WHERE requestUrl LIKE `%example.com%`

### Android


[Feature Flags](/docs/mobile-monitoring/new-relic-mobile-android/android-sdk-api/android-agent-configuration-feature-flags/)


Expand Down Expand Up @@ -454,11 +442,9 @@ See this document for [more details](/docs/mobile-monitoring/new-relic-mobile-io
- Log forwarding configuration
</Callout>


New Relic's [Infrastructure agent configuration file](/docs/infrastructure/install-infrastructure-agent/configuration/infrastructure-agent-configuration-settings/) contains a couple of powerful ways to control ingest volume. The most important is using sampling rates. There are several distinct sampling rate configurations that can be used:
The other is through custom process sample filters.


### Sampling rates

There are a number of sampling rates that can be configured in infrastructure, but these are the most commonly used.
Expand All @@ -471,7 +457,6 @@ There are a number of sampling rates that can be configured in infrastructure, b
|metrics_system_sample_rate|5|-1|
|metrics_nfs_sample_rate|5|-1|


### Process samples

Process samples can be the single most high volume source of data from the infrastructure agent. This is because it will send information about any running process on a host. They are disabled by default, however they can be enabled as follows:
Expand Down Expand Up @@ -541,7 +526,6 @@ Default network interface filters for Windows:

- Network interfaces that start with `Loop`, `isatap`, or `Local`


To override defaults include your own filter in the config file:

```
Expand Down Expand Up @@ -603,7 +587,6 @@ Controller manager data*
ETCD data*
Scheduler data*
*Not collected in a managed Kubernetes environment (EKS, GKE, AKS, etc.)
**Used in the default alert: “ReplicaSet doesn't have desired amount of pods”
```
Expand Down Expand Up @@ -641,10 +624,8 @@ New Relic's on-host integrations (OHI for short) represent a diverse set of inte

We'll use a few examples to demonstrate.


### [PostgreSQL integration](/docs/infrastructure/host-integrations/host-integrations-list/postgresql-monitoring-integration/#example-postgresSQL-collection-config)


<Callout variant='IMPORTANT' title='Growth drivers'>
- Number of tables monitored
- Number of indices monitored
Expand Down Expand Up @@ -683,10 +664,8 @@ integrations:
inventory_source: config/postgresql
```


### [Kafka integration](/docs/infrastructure/host-integrations/host-integrations-list/kafka-monitoring-integration/)


<Callout variant='IMPORTANT' title='Growth drivers'>
- Number of brokers in cluster
- Number of topics in cluster
Expand Down Expand Up @@ -750,12 +729,8 @@ integrations:
inventory_source: config/mongodb
```


### [Elasticsearch integration](/docs/infrastructure/host-integrations/host-integrations-list/elasticsearch-monitoring-integration)




<Callout variant='IMPORTANT' title='Growth drivers'>
- Number of nodes in cluster
- Number of indices in cluster
Expand Down Expand Up @@ -801,7 +776,6 @@ integrations:
inventory_source: config/elasticsearch
```


### [JMX integration](/docs/infrastructure/host-integrations/host-integrations-list/jmx-monitoring-integration)

<Callout variant='IMPORTANT' title='Growth drivers'>
Expand Down Expand Up @@ -848,7 +822,6 @@ There are many other on-host integrations with configuration options that will h

This is a good [starting point](/docs/infrastructure/infrastructure-integrations/get-started/introduction-infrastructure-integrations#on-host) to learn more.


</Collapser>
<Collapser
id="network-performance-monitoring"
Expand All @@ -861,11 +834,8 @@ This is a good [starting point](/docs/infrastructure/infrastructure-integrations
- traps configured
</Callout>


This section focuses on New Relic's network performance monitoring which relies in the `ktranslate` agent from Kentik. This agent is quite sophisticated and it is important to fully understand the [advanced configuration docs](/docs/network-performance-monitoring/advanced/advanced-config) before major optimization efforts.



- mibs_enabled: Array of all active MIBs the ktranslate docker image will poll. This list is automatically generated during discovery if the discovery_add_mibs attribute is true. MIBs not listed here will not be polled on any device in the configuration file. You can specify a SNMP table directly in a MIB file using MIB-NAME.tableName syntax. Ex: HOST-RESOURCES-MIB.hrProcessorTable.
- user_tags: key:value pair attributes to give more context to the device. Tags at this level will be applied to all devices in the configuration file.
- devices: Section listing devices to be monitored for flow
Expand Down Expand Up @@ -936,8 +906,6 @@ Below are some detailed routing and filtering resources:
- [Fluentbit data pipeline](https://docs.fluentbit.io/manual/concepts/data-pipeline)
- [Forwarding logs with New Relic infrastructure agent](/docs/logs/forward-logs/forward-your-logs-using-infrastructure-agent/)



</Collapser>
<Collapser
id="prometheus-metrics-sources"
Expand All @@ -957,7 +925,6 @@ Prometheus server scrape config options are [fully documented here](https://prom

### Option 2: [Prometheus OpenMetrics Integration (POMI)](/docs/integrations/prometheus-integrations/install-configure-openmetrics/install-update-or-uninstall-your-prometheus-openmetrics-integration)


POMI is a standalone integration that scrapes metrics from both dynamically discovered and static Prometheus endpoints. POMI then sends this data to NRDB via the New Relic Metric API. This integration is ideal for customers not currently running Prometheus Server.

#### POMI: scrape label
Expand All @@ -980,7 +947,6 @@ POMI will discover any Prometheus endpoint exposed at the node-level by default.
If you are running the New Relic Kubernetes Daemonset, it is important that you set require_scrape_enabled_label_for_nodes: true so that POMI does not collect duplicate metrics.
The endpoints targeted by the New Relic Kubernetes Daemonset are outlined [here](https://github.com/newrelic/nri-kubernetes/blob/main/README.md).


#### POMI: scrape label for nodes

POMI will discover any Prometheus endpoint exposed at the node-level by default. This typically includes metrics coming from Kubelet and cAdvisor.
Expand All @@ -998,13 +964,12 @@ The endpoints targeted by the New Relic Kubernetes Daemonset are outlined [here]

#### POMI: Co-existing with *nri-kubernetes*


New Relic’s [Kubernetes integration](/docs/integrations/kubernetes-integration/get-started/introduction-kubernetes-integration) collects a [number of metrics](/docs/integrations/kubernetes-integration/understand-use-data/find-use-your-kubernetes-data#metrics) OOTB, however, it does not collect every possible metric available from a Kubernetes Cluster.
New Relic's [Kubernetes integration](/docs/integrations/kubernetes-integration/get-started/introduction-kubernetes-integration) collects a [number of metrics](/docs/integrations/kubernetes-integration/understand-use-data/find-use-your-kubernetes-data#metrics) OOTB, however, it does not collect every possible metric available from a Kubernetes Cluster.


In the POMI config, youll see a section similar to this which will *disable* metric collection for a subset of metrics that the New Relic Kubernetes integration is already collecting from *Kube State Metrics*.
In the POMI config, you'll see a section similar to this which will *disable* metric collection for a subset of metrics that the New Relic Kubernetes integration is already collecting from *Kube State Metrics*.

Its also very important to set `require_scrape_enabled_label_for_node: true` so that Kubelet and cAdvisor metrics are not duplicated.
It's also very important to set `require_scrape_enabled_label_for_node: true` so that Kubelet and cAdvisor metrics are not duplicated.

*POMI config parameters*

Expand All @@ -1029,7 +994,7 @@ transformations:

#### POMI: request/limit settings

When running POMI, its recommended to apply the following [resource limits](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/) for clusters generating approximately 500k DPM:
When running POMI, it's recommended to apply the following [resource limits](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/) for clusters generating approximately 500k DPM:

- CPU limit: 1 core (1000m)
- Memory limit: 1Gb 1024 (1G)
Expand Down Expand Up @@ -1069,7 +1034,6 @@ Trial and paid accounts receive a 1M DPM and 1M cardinality limit for trial purp

If you’re already running Prometheus Server, you can run DPM and cardinality estimates there prior to enabling POMI or remote_write.


*Data points per minute (DPM)*

rate(prometheus_tsdb_head_samples_appended_total[10m]) * 60
Expand All @@ -1079,8 +1043,6 @@ rate(prometheus_tsdb_head_samples_appended_total[10m]) * 60

topk(20, count by (__name__, job)({__name__=~".+"}))



</Collapser>
<Collapser
id="cloud-integration"
Expand Down Expand Up @@ -1265,8 +1227,6 @@ mutation {
}
}
```


</Collapser>
<Collapser
id="cloud-metrics"
Expand Down Expand Up @@ -1307,16 +1267,14 @@ mutation {
}
}
```



</Collapser>
<Collapser
id="drop-specific-attributes"
title="Drop Specific Attributes"
title="Drop specific attributes"
>
One powerful thing about drop rules is that we can configure a rule that drops specific attributes but maintains the rest of the data intact. This can be done to remove private data from NRDB, but can also be used to drop excessively large attributes. Examples of very large fields are things like stack traces or large chunks of JSON in log records.
For these kinds of drop rules set the *action* field to `DROP_ATTRIBUTES` instead of `DROP_DATA`
One powerful thing about drop rules is that we can configure a rule that drops specific attributes but maintains the rest of the data intact. Use this to remove private data from NRDB, or to drop excessively large attributes. For example, stack traces or large chunks of JSON in log records can sometimes be very large.

To set these drop rules, change the *action* field to `DROP_ATTRIBUTES` instead of `DROP_DATA`.

```
mutation {
Expand All @@ -1340,14 +1298,14 @@ mutation {
</Collapser>
<Collapser
id="drop-random-sample-of-events"
title="Drop Random Sample of Events"
title="Drop random sample of events"
>

<Callout variant="caution">
This approach needs to be used carefully in situations where there are no other alternatives. It can alter statistical inferrences made from your data. However for events with massive sample size you may be okay with only a portion of the data as long as you understand the consequences.
Use this approach carefully, and only in situations where there are no other alternatives, since it can alter statistical inferrences made from your data. However, for events with massive sample size, you may do with only a portion of your data as long as you understand the consequences.
</Callout>
In this example we will take advantage of the relative distribution of certain trace ids to approximate random sampling. We can use the `rlike` operator to check for the leading values of a Span's `trace.id` attribute.

In this example we'll take advantage of the relative distribution of certain trace ids to approximate random sampling. We can use the `rlike` operator to check for the leading values of a span's `trace.id` attribute.

The following example could drop about 25% of spans.

Expand All @@ -1362,8 +1320,7 @@ Useful expressions include:
* `^[0-2].*` approximates 18.75%
* `^[0-3].*` approximates 25.0%


Example of a full mutation:
See an example of a full mutation:

```
mutation {
Expand Down Expand Up @@ -1395,6 +1352,4 @@ The preceding examples should show you all you need to know to use these techniq
</Collapser>
</CollapserGroup>
</Collapser>
</CollapserGroup>


</CollapserGroup>

0 comments on commit 48b7d91

Please sign in to comment.