Skip to content

Latest commit

 

History

History
1394 lines (1082 loc) · 61.7 KB

data-optimize-techniques.mdx

File metadata and controls

1394 lines (1082 loc) · 61.7 KB
title metaDescription redirects freshnessValidatedDate
Optimize your ingest data
Taking your ingested and reported ingest data and optimizing it.
/docs/new-relic-solutions/observability-maturity/operational-efficiency/data-governance-optimize-ingest-guide/
/docs/new-relic-solutions/observability-maturity/operational-efficiency/data-governance-forecast-ingest-guide/
never

In the previous step, you created and refined your data optimization plan by checking your baseline report against your organization's objectives. Once you've lined up your data and measured it against your value drivers, you can start to optimize, and potentially reduce, your ingest data. There are two main ways to do this:

  • Optimize for data efficiency
  • Optimize using drop rules

We'll cover both methods below, as well as all the possible configurations that each option provides.

Optimize for data efficiency [#optimize-efficiency]

This section includes various ways to configure New Relic features to optimize data reporting and ingest:

* Monitored transactions * Error activity * Custom events

The volume of data generated by the APM agent will be determined by several factors:

  • The amount of organic traffic generated by the application (for example, all things being equal an application being called one million times per day will generate more data than one being called one thousand times per day)
  • Some of the characteristics of the underlying transaction data itself (length and complexity of URLs)
  • Whether the application is reporting database queries
  • Whether the application has transactions with many (or any) custom attributes
  • The error volume for the application
  • Whether the application agent is configured for distributed tracing

Managing volume

While you can assume that all calls to an application are necessary, it's possible to make your overall architecture more efficient. You may have a user profile microservice called every 10 seconds by its clients. This helps reduce latency if some user information is updated by other clients. However, you have the option to reduce the frequency of calls to this service to every minute, for example.

Custom attributes

Any custom attributes added using a call to an APM API addCustomParameter will add an additional attribute to the transaction payload. These are often useful, but as things change, the data can become less valuable or even obsolete.

The Java agent captures the following request.headers by default:

  • request.headers.referer
  • request.headers.accept
  • request.headers.contentLength
  • request.headers.host
  • request.headers.userAgent

Developers might also use addCustomParameter to capture more information using more verbose headers.

For an example of the rich configuration that's available in relation to APM, see our Java agent documentation

Error events

It's possible to reduce the volume of data by finding how APM will handle errors. For example, there may be a harmless but high-volume error that you can't remove at the present time.

To do this, you can use collect, ignore, or mark as expected for errors. For more information, see Manage APM errors.

Database queries

One highly variable aspect of APM instances is the number of database calls and set configurations. To help with this, you can control how verbose database query monitoring is. These queries will show up in the Transaction traces page.

Common database query setting changes include:

For more details, see Transaction traces database queries page.

Setting event limits

Our APM and mobile agents have limits on how many events they can report per harvest cycle. If there were no limit, a large enough number of sent events could impact the performance of your application or of New Relic. Upon reaching the limit, the agents begin sampling events to give a representation of events across the harvest cycle. Different agents have different limits.

Events that with limits and subject to sampling include:

  • Custom events reported via agent API (for example, the .NET agent's RecordCustomEvent)
  • Mobile
  • MobileCrash
  • MobileHandledException
  • MobileRequest
  • Span (see distributed tracing sampling)
  • Transaction
  • TransactionError

Most agents have configuration options for changing the event limit on sampled transactions. For example, the Java agent uses max_samples_stored. The default value for max_samples_stored is 2000 and the max is 10000. This value governs how many sampled events can report every 60 seconds from an agent instance. For a full explanation of event sampling limits, see Event limits.

You can compensate for sampled events via the NRQL EXTRAPOLATE operator.

Before attempting to change how sampling occurs, keep the following in mind:

  • The more events you report, the more memory your agent will use.
  • You can usually get the data you need without raising an agent's event-reporting limit.
  • The payload size limit is 1MB (10^6 bytes) (compressed), so the number of events may still be affected by that limit. To find if events are being dropped, see the agent log for a 413 HTTP status message.

Log sampling rate

Newer versions of the New Relic APM language agents can forward logs directly to New Relic. Sometimes, you may want to govern some limits of how big logging spikes can be from each APM agent instance.

For details on APM agent log sampling, see Log forwarders.

Transaction traces

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of connected services
  • Number of monitored method calls per connected services

In APM, transaction traces record in-depth details about your application's transactions and database calls. You can edit the default settings for transaction traces.

This is also highly configurable via transaction trace configuration. The level and mode of configurability will be language-specific.

Transaction trace settings available using server-side configuration will differ depending on the New Relic agent you use. The UI includes descriptions of each. Settings in the UI may include:

  • Transaction tracing and threshold
  • Record SQL, including recording level and input fields
  • Log SQL and stack trace threshold
  • SQL query plans and threshold
  • Error collection, including HTTP code and error class
  • Slow query tracing
  • Thread profiler

Distributed tracing

Distributed tracing configuration has some language-specific differences. You can disable distributed tracing as needed. This is an example for Java agent newrelic.yml:

distributed_tracing:
    enabled: false

This is a node.js example for newrelic.js

distributed_tracing: {
  enabled: false
}

Data volume also varies based on whether you are using Infinite Tracing. Standard distributed tracing for APM agents (above) captures up to 10% of your traces, but if you want to analyze all your data and find the most relevant traces, you can set up Infinite Tracing. This alternative to standard distributed tracing is available for all APM language agents. The main parameters that could drive a small increase in monthly ingest are:

  • Configure trace observer monitoring

  • Configure span attribute trace filter

  • Configure random trace filter

    <Collapser id="browser-agent" title="Browser agent"

<Callout variant="IMPORTANT" title="Growth drivers"

  • Page loads
  • Ajax calls
  • Error activity

For browser agent version 1211 or higher, all network requests made by a page are recorded as AjaxRequest events. You can use the deny list configuration options in the application settings UI page to filter which requests record events. Regardless of this filter, all network requests are captured as metrics and available in the AJAX page.

Using the deny list

You can block requests in three ways:

  • To block recording of all AjaxRequest events, add an asterisk * as a wildcard.
  • To block recording of AjaxRequest events to a domain, enter just the domain name. Example: example.com
  • To block recording of AjaxRequest events to a specific domain and path, enter the domain and path. Example: example.com/path
  • The protocol, port, search and hash of a URL are ignored by the deny list.

To validate whether the filters you have added work as expected, run a NRQL query for AjaxRequest events matching your filter.

Accessing the deny list

To update the deny list of URLs your application will filter from creating events, go to the app settings UI page:

  1. Go to one.newrelic.com, and click Browser.
  2. Select an app.
  3. On the left navigation, click App settings.
  4. Under Ajax request deny list, add the filters you'd like to apply.
  5. Select Save application settings to update the agent configuration.
  6. Redeploy the browser agent by either restarting the associated APM agent or updating the copy/paste browser installation.

Validating

FROM AjaxRequest SELECT * WHERE requestUrl LIKE `%example.com%`

<Collapser id="mobile-agent" title="Mobile agent"

<Callout variant="IMPORTANT" title="Growth drivers"

  • Monthly active users
  • Crash events
  • Number of events per user

Android

All settings, including the call to invoke the agent, are called in the onCreate method of the MainActivity class. To change settings, call the setting in one of two ways (if the setting supports it):

NewRelic.disableFeature(FeatureFlag.DefaultInteractions);
NewRelic.enableFeature(FeatureFlag.CrashReporting);
NewRelic.withApplicationToken(NEW_RELIC_TOKEN).start(this.getApplication());

Analytics settings enable or disable the collection of event data. These events are reported to and used in the Crash analysis page.

It's also possible to configure agent logging to be more or less verbose.

iOS

Like with Android, New Relic's iOS configuration allows to enable and disable feature flags.

You can configure the following feature flags:

Crash and error reporting

  • NRFeatureFlag_CrashReporting
  • NRFeatureFlag_HandleExceptionEvents
  • NRFeatureFlag_CrashReporting

Distributed tracing

  • NRFeatureFlag_DistributedTracing

Interactions

  • NRFeatureFlag_DefaultInteractions
  • NRFeatureFlag_InteractionTracing
  • NRFeatureFlag_SwiftInteractionTracing

Network feature flags

  • NRFeatureFlag_ExperimentalNetworkInstrumentation
  • NRFeatureFlag_NSURLSessionInstrumentation
  • NRFeatureFlag_NetworkRequestEvents
  • NRFeatureFlag_RequestErrorEvents
  • NRFeatureFlag_HttpResponseBodyCapture

For more details, see Feature flags.

<Collapser id="infrastructure-agent" title="Infrastructure agent"

<Callout variant="IMPORTANT" title="Growth drivers"

  • Hosts and containers monitored
  • Sampling rates for core events
  • Process sample configurations
  • Custom attributes
  • Number and type of on-host integrations installed
  • Log forwarding configuration

The infrastructure agent configuration file contains two ways to control ingest volume. The most important ingest control is configuring sampling rates. There are several distinct sampling rate configurations that you can adjust. It's also possible to create regular expressions to control what gets collected from certain collectors, such as ProcessSample and NetworkSample.

Configurable sampling rates

There are a number of sampling rates that you can configure in infrastructure, but these are the most commonly used.

Parameter Default Disable
metrics_storage_sample_rate 5 -1
metrics_process_sample_rate 20 -1
metrics_network_sample_rate 10 -1
metrics_system_sample_rate 5 -1
metrics_nfs_sample_rate 5 -1

Process samples

Process samples are often the single most high volume source of data from the infrastructure agent because it sends information about any running process on a host. They're disabled by default, but you can enable them as follows:

enable_process_metrics: true

This has the same effect as setting metrics_process_sample_rate to -1. By default, processes using low memory are excluded from sampling. For more information, see disable-zero-mem-process-filter.

You can control how much data you send by configuring include_matching_metrics, which allows you to restrict the transmission of metric data based on the values of metric attributes. You include metric data by defining literal or partial values for any of the attributes of the metric. For example, you can choose to send the host.process.cpuPercent of all processes whose process.name matches the ^java regular expression.

In this example, we include process metrics using executable files and names:

  include_matching_metrics:             # You can combine attributes from different metrics
    process.name:
      - regex "^java"                   # Include all processes starting with "java"
    process.executable:
      - "/usr/bin/python2"              # Include the Python 2.x executable
      - regex "\\System32\\svchost"     # Include all svchost executables

You can also use this filter for the Kubernetes integration:

  env:
    - name: NRIA_INCLUDE_MATCHING_METRICS
      value: |
        process.name:
          - regex "^java"
        process.executable:
          - "/usr/bin/python2"
          - regex "\\System32\\svchost"

Network interface filter

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of network interfaces monitored

The configuration uses a simple pattern-matching mechanism that can look for interfaces that start with a specific sequence of letters or numbers following either pattern:

  • {name}[other characters]
  • [number]{name}[other characters], where you specify the name using the index-1 option
network_interface_filters:
  prefix:
    - dummy
    - lo
  index-1:
    - tun

Default network interface filters for Linux:

  • Network interfaces that start with dummy, lo, vmnet, sit, tun, tap, or veth
  • Network interfaces that contain tun or tap

Default network interface filters for Windows:

  • Network interfaces that start with Loop, isatap, or Local

To override defaults include your own filter in the config file:

network_interface_filters:
  prefix:
    - dummy
    - lo
  index-1:
    - tun

Custom attributes

Custom attributes are key-value pairs similar to tags in other tools used to annotate the data from the infrastructure agent. You can use this metadata to build filter sets, group your results, and annotate your data. For example, you might indicate a machine's environment (staging or production), the service a machine hosts (login service, for example), or the team responsible for that machine.

Example of custom attributes from newrelic.yml

custom_attributes:
  environment: production
  service: billing
  team: alpha-team
If the data isn't well organized or has become obsolete in any way you, should consider streamling these.

<Collapser id="k8s-integration" title="Kubernetes integration"

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of pods and containers monitored
  • Frequency and number of kube state metrics collected
  • Logs generated per cluster

Complex and decentralized systems like Kubernetes have the potential to generate a lot of telemetry in a short amount of time. There are a few good approaches to managing data ingest in Kubernetes. These will be very straightforward if you are using observability as code in your K8s deployments.

We highly recommend you install this Kubernetes data ingest analysis dashboard before making any decisions about reducing ingest. To get this dashboard, see the Infrastructure integrations quickstart.

Scrape interval

Depending on your observability objectives, you may consider adjusting the scrape interval, which has a default time of 15 seconds. The Kubernetes cluster explorer only refreshes every 45s. If your primary use of the Kubernetes data is to support the KCE visualizations, you may consider changing your scrape interval to 20s. Changing from 15s to 20s can have a substantial impact.

For more details about managing this, see our Helm integration scrape interval docs.

Filtering Namespaces

The Kubernetes integration version 3 and above allows filtering which namespaces are scraped by labelling them. By default, all namespaces are scraped.

We use the namespaceSelector in the same way Kubernetes does. To include only namespaces matching a label, change the namespaceSelector by adding the following to your values-newrelic.yaml, under the newrelic-infrastructure section:

common:
  config:
    namespaceSelector:
      matchLabels:
        key1 : "value1"

In this example only namespaces with the label newrelic.com/scrape set to true will be scraped:

global:
  licenseKey: _YOUR_NEW_RELIC_LICENSE_KEY_
  cluster: _K8S_CLUSTER_NAME_

# ... Other settings as shown above

# Configuration for newrelic-infrastructure
newrelic-infrastructure:
  # ... Other settings as shown above
  common:
    config:
      namespaceSelector:
        matchLabels:
          newrelic.com/scrape: "true"

You can also use Kubernetes match expressions to include or exclude namespaces. The valid operators are:

  • In
  • NotIn
  • Exists
  • DoesNotExist

The general structure the matchExpressions section is one or more of the following lines:

{key: VALUE, operator: OPERATOR, values: LIST_OF_VALUES}

Here's a complete example:

common:
  config:
    namespaceSelector:
      matchExpressions:
      - {key: newrelic.com/scrape, operator: NotIn, values: ["false"]}
You can include More than one line in the `matchExpresions` section, and the expressions are concatenated. All must be true for the filter to apply. Labels and match expressions are explained in more detail [here](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/).

In this example, namespaces with the label newrelic.com/scrape set to false will be excluded:

global:
  licenseKey: _YOUR_NEW_RELIC_LICENSE_KEY_
  cluster: _K8S_CLUSTER_NAME_

# ... Other settings as shown above

# Configuration for newrelic-infrastructure
newrelic-infrastructure:
  # ... Other settings as shown above
  common:
    config:
      namespaceSelector:
        matchExpressions:
        - {key: newrelic.com/scrape, operator: NotIn, values: ["false"]}

See a full list of the settings that you can in the chart's README file.

How can I know which namespaces are excluded? [#excluded-namespaces]

All the namespaces within the cluster are listed thanks to the K8sNamespace sample. The nrFiltered attribute determines whether the data related to the namespace is going to be scraped.

Use this query to know which namespaces are being monitored:

FROM K8sNamespaceSample SELECT displayName, nrFiltered
WHERE clusterName = INSERT_NAME_OF_CLUSTER SINCE
2 MINUTES AGO

What data is being discarded from the excluded namespaces? [#namespaces-discarded-data]

The following samples won't be available for the excluded namespaces:

  • K8sContainerSample
  • K8sDaemonsetSample
  • K8sDeploymentSample
  • K8sEndpointSample
  • K8sHpaSample
  • K8sPodSample
  • K8sReplicasetSample
  • K8sServiceSample
  • K8sStatefulsetSample
  • K8sVolumeSample

Kubernetes state metrics

The Kubernetes cluster explorer requires only the following kube state metrics (KSM):

  • Container data
  • Cluster data
  • Node data
  • Pod data
  • Volume data
  • API server data1
  • Controller manager data1
  • ETCD data1
  • Scheduler data1

1 Not collected in a managed Kubernetes environment (EKS, GKE, AKS, etc.)

You may consider disabling some of the following:

  • DaemonSet data
  • Deployment data
  • Endpoint data
  • Namespace data
  • ReplicaSet data2
  • Service data
  • StatefulSet data

2 Used in the default alert: “ReplicaSet doesn't have desired amount of pods”

Example of updating state metrics in manifest (Deployment)

[spec]
  [template]
    [spec]
      [containers]
        [name=kube-state-metrics]
        [args]
        #- --collectors=daemonsets
        #- --collectors=deployments
        #- --collectors=endpoints
        #- --collectors=namespaces
        #- --collectors=replicasets
        #- --collectors=services
        #- --collectors=statefulsets
    _Example of updating state metrics in manifest (ClusterRole)_

```shell
[rules]
# - apiGroups: ["extensions", "apps"]
#   resources:
#   - daemonsets
#   verbs: ["list", "watch"]

# - apiGroups: ["extensions", "apps"]
#   resources:
#   - deployments
#   verbs: ["list", "watch"]

# - apiGroups: [""]
#   resources:
#   - endpoints
#   verbs: ["list", "watch"]

# - apiGroups: [""]
#   resources:
#   - namespaces
#   verbs: ["list", "watch"]

# - apiGroups: ["extensions", "apps"]
#   resources:
#   - replicasets
#   verbs: ["list", "watch"]

# - apiGroups: [""]
#   resources:
#   - services
#   verbs: ["list", "watch"]

# - apiGroups: ["apps"]
#   resources:
#   - statefulsets
#   verbs: ["list", "watch"]
```

Config lowDataMode in nri-bundle chart

Our Helm charts support the option to reduce the amount of data ingested at the cost of dropping detailed information. To enable it, set global.lowDataMode to true in the nri-bundle chart.

lowDataMode affects three specific components of the nri-bundle chart:

  1. Increase infrastructure agent interval from 15 to 30 seconds.
  2. Prometheus OpenMetrics integration will exclude a few metrics as indicated in the Helm doc below.
  3. Labels and annotations details will be dropped from logs.

You can find more details about this configuration in our Helm doc.

<Collapser id="on-host-integrations" title="On-host integrations"

New Relic's on-host integrations represent a diverse set of integrations for third party services such as Postgresql, MySQL, Kafka, RabbitMQ, etc. It's impossible to provide every optimization technique in the scope of this document, but these techniques generally apply:

  • Manage sampling rate
  • Manage those parts of the configuration that can increase or decrease breadth of collection
  • Manage those parts of the configuration that allow for custom queries
  • Manage the infrastructure agents' custom attributes to apply to all on-host integration data.

We'll use a few examples to demonstrate.

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of tables monitored
  • Number of indices monitored

The PostgreSQL on-host integration configuration provides these adjustable settings that can help manage data volume:

  • interval: Default is 15s
  • COLLECTION_LIST: list of tables to monitor (use ALL to monitor ALL)
  • COLLECT_DB_LOCK_METRICS: Collect dblock metrics
  • PGBOUNCER: Collect pgbouncer metrics
  • COLLECT_BLOAT_METRICS: Collect bloat metrics
  • METRICS: Set to true to collect only metrics
  • INVENTORY: Set to true to enable only inventory collection
  • CUSTOM_METRICS_CONFIG: Config file containing custom collection queries

Sample config:

integrations:
  - name: nri-postgresql
    env:
      USERNAME: postgres
      PASSWORD: pass
      HOSTNAME: psql-sample.localnet
      PORT: 6432
      DATABASE: postgres
      COLLECT_DB_LOCK_METRICS: false
      COLLECTION_LIST: '{"postgres":{"public":{"pg_table1":["pg_index1","pg_index2"],"pg_table2":[]}}}'
      TIMEOUT:  10
    interval: 15s
    labels:
      env: production
      role: postgresql
    inventory_source: config/postgresql

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of brokers in cluster
  • Number of topics in cluster

The Kafka on-host integration configuration provides these adjustable settings that can help manage data volume:

  • interval: Default is 15s
  • TOPIC_MODE: Determines how many topics we collect. Options are all, none, list, or regex.
  • METRICS: Set to true to collect only metrics
  • INVENTORY: Set to true to enable only inventory collection
  • TOPIC_LIST: JSON array of topic names to monitor. Only in effect if topic_mode is set to list.
  • COLLECT_TOPIC_SIZE: Collect the metric Topic size. Options are true or false, defaults to false.
  • COLLECT_TOPIC_OFFSET:Collect the metric Topic offset. Options are true or false, defaults to false.

The collection of topic level metrics, especially offsets, can be resource intensive to collect and can have an impact on data volume. A cluster's ingest can increase by an order of magnitude simply by the addition of new Kafka topics to the cluster.

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of databases monitored

The MongoDB integration provides these adjustable settings that can help manage data volume:

  • interval: Default is 15s
  • METRICS: Set to true to collect only metrics
  • INVENTORY: Set to true to enable only inventory collection
  • FILTERS: A JSON map of database names to an array of collection names. If empty, it defaults to all databases and collections.

For any on-host integration you use, it's important to be aware of parameters like FILTERS where the default is to collect metrics from all databases. This is an area where you can use your monitoring priorities to streamline collected data.

Example configuration with different intervals for METRIC and INVENTORY:

integrations:
  - name: nri-mongodb
    env:
      METRICS: true
      CLUSTER_NAME: my_cluster
      HOST: localhost
      PORT: 27017
      USERNAME: mongodb_user
      PASSWORD: mongodb_password
    interval: 15s
    labels:
      environment: production

  - name: nri-mongodb
    env:
      INVENTORY: true
      CLUSTER_NAME: my_cluster
      HOST: localhost
      PORT: 27017
      USERNAME: mongodb_user
      PASSWORD: mongodb_password
    interval: 60s
    labels:
      environment: production
    inventory_source: config/mongodb

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of nodes in cluster
  • Number of indices in cluster

The Elasticsearch integration provides these adjustable settings that can help manage data volume:

  • interval: Default is 15s
  • METRICS: Set to true to collect only metrics
  • INVENTORY: Set to true to enable only inventory collection
  • COLLECT_INDICES: Signals whether to collect indices metrics or not.
  • COLLECT_PRIMARIES: Signals whether to collect primary metrics or not.
  • INDICES_REGEX: Filter which indices are collected.
  • MASTER_ONLY: Collect cluster metrics on the elected master only.

Example configuration with different intervals for METRICS and INVENTORY:

integrations:
  - name: nri-elasticsearch
    env:
      METRICS: true
      HOSTNAME: localhost
      PORT: 9200
      USERNAME: elasticsearch_user
      PASSWORD: elasticsearch_password
      REMOTE_MONITORING: true
    interval: 15s
    labels:
      environment: production

  - name: nri-elasticsearch
    env:
      INVENTORY: true
      HOSTNAME: localhost
      PORT: 9200
      USERNAME: elasticsearch_user
      PASSWORD: elasticsearch_password
      CONFIG_PATH: /etc/elasticsearch/elasticsearch.yml
    interval: 60s
    labels:
      environment: production
    inventory_source: config/elasticsearch

<Callout variant="IMPORTANT" title="Growth drivers"

  • Metrics listed in COLLECTION_CONFIG

The JMX integration is inherently generic. It allows you to scrape metrics from any JMX instance. You have control over what gets collected by this integration. In some enterprise, New Relic environments JMX metrics reprepresent a relatively high proportion of all data collected.

The JMX integration provides these adjustable settings that can help manage data volume:

  • interval: Default is 15s
  • METRICS: Set to true to collect only metrics
  • INVENTORY: Set to true to enable only inventory collection
  • METRIC_LIMIT: Number of metrics that can be collected per entity. If this limit is exceeded the entity will not be reported. A limit of 0 implies no limit.
  • LOCAL_ENTITY: Collect all metrics on the local entity. Only used when monitoring localhost.
  • COLLECTION_FILES: A comma-separated list of full file paths to the metric collection definition files. For on-host install, the default JVM metrics collection file is at /etc/newrelic-infra/integrations.d/jvm-metrics.yml.
  • COLLECTION_CONFIG: Configuration of the metrics collection as a JSON.

It's the COLLECTION_CONFIG entries that most govern the amount of data ingested. Understanding the JMX model you are scraping will help you optimize.

COLLECTION_CONFIG example for JVM metrics

COLLECTION_CONFIG='{"collect":[{"domain":"java.lang","event_type":"JVMSample","beans":[{"query":"type=GarbageCollector,name=*","attributes":["CollectionCount","CollectionTime"]},{"query":"type=Memory","attributes":["HeapMemoryUsage.Committed","HeapMemoryUsage.Init","HeapMemoryUsage.Max","HeapMemoryUsage.Used","NonHeapMemoryUsage.Committed","NonHeapMemoryUsage.Init","NonHeapMemoryUsage.Max","NonHeapMemoryUsage.Used"]},{"query":"type=Threading","attributes":["ThreadCount","TotalStartedThreadCount"]},{"query":"type=ClassLoading","attributes":["LoadedClassCount"]},{"query":"type=Compilation","attributes":["TotalCompilationTime"]}]}]}'

Omitting any one entry from that config such as NonHeapMemoryUsage.Init will have a tangible impact on the overall data volume collected.

COLLECTION_CONFIG example for Tomcat metrics

COLLECTION_CONFIG={"collect":[{"domain":"Catalina","event_type":"TomcatSample","beans":[{"query":"type=UtilityExecutor","attributes":["completedTaskCount"]}]}]}

Other on-host integrations

There are many other on-host integrations with configuration options that will help you optimize collection. Some commonly used ones are:

This is a good starting point to learn more.

<Collapser id="network-performance-monitoring" title="Network performance monitoring (NPM)"

<Callout variant="IMPORTANT" title="Growth drivers"

Monitored devices driven by:

  • hard configured devices
  • CIDR scope in discovery section
  • traps configured

This section focuses on New Relic's network performance monitoring which relies on the ktranslate agent from Kentik. This agent is quite sophisticated and it's important to fully understand the advanced configuration docs before major optimization efforts. Configuration options include:

  • mibs_enabled: Array of all active MIBs the KTranslate docker image will poll. This list is automatically generated during discovery if the discovery_add_mibs attribute is true. MIBs not listed here will not be polled on any device in the configuration file. You can specify a SNMP table directly in a MIB file using MIB-NAME.tableName syntax. Ex: HOST-RESOURCES-MIB.hrProcessorTable.
  • user_tags: key:value pair attributes to give more context to the device. Tags at this level will be applied to all devices in the configuration file.
  • devices: Section listing devices to be monitored for flow
  • traps: configures IP and ports to be monitored with SNMP traps (default is 127.0.0.1:1162)
  • discovery: configures how endpoints can be discovered. Under this section the following parameters will do the most to increase or decrease scope:
    • cidrs: Array of target IP ranges in CIDR notation.
    • ports: Array of target ports to scan during SNMP polling.
    • debug: Indicates whether to enable debug level logging during discovery. By default, it's set to false
    • default_communities: Array of SNMPv1/v2c community strings to scan during SNMP polling. This array is evaluated in order and discovery accepts the first passing community.

To support filtering of data that does not create value for your observability needs, you can set the global.match_attributes.{} and/or devices.<deviceName>.match_attributes.{} attribute map.

This will provide filtering at the KTranslate level, before shipping data to New Relic, giving you granular control over monitoring of things like interfaces.

For more details, see Network performance monitoring configuration.

<Collapser id="log-forwarders" title="Log forwarders"

<Callout variant="IMPORTANT" title="Growth drivers"

  • Logs forwarded
  • Average size of forward log records

Logs represent one of the most flexible sources of telemetry in that we are typically routing logs through a dedicated forwarding layer with its own routing and transform rules. Because there are a variety of forwarders, we'll focus on the most commonly used ones:

  • APM language agents (recent versions)
  • Fluentd
  • Fluentbit
  • New Relic infrastructure agent (built-in Fluentbit)
  • Logstash

APM agent log sampling

Recent versions of the New Relic language agents can forward logs directly to New Relic. You may want to govern some limits of how big logging spikes can be from each APM agent instance.

You can enable sampling with the environment variable NEW_RELIC_APPLICATION_LOGGING_FORWARDING_MAX_SAMPLES_STORED, and configure it by providing the max number of logs that the APM agents logging queue will store. It operates based on a custom priority queue and gives all log messages a priority. Logs that occur within a transaction get the transaction's priority.

The queue for logs is sorted based on the priority and when the log arrives. Higher priority goes first and, if needed, the newest log takes priority. Logs are added individually to the queue (even ones in a transaction), and upon reaching the limit, the log at the end of the queue is pushed out in favor of the newer log.

In the resources section below, there is a quickstart dashboard that helps you track log volume in a simple way. Tracking log volume will enable you to adjust or disable sampling rate to suit your observability needs.

Configuring filters in Fluentd or Fluentbit

Most general forwarders provide a fairly complete routing workflow that includes filtering, and transformation. Our infrastructure agent provides very simple patterns for filtering unwanted logs.

Regular expression for filtering records. Only supported for the tail, systemd, syslog, and tcp (only with format none) sources. This field works in a way similar to grep -E in Unix systems. For example, for a given file being captured, you can filter for records containing either WARN or ERROR using:

  - name: only-records-with-warn-and-error
    file: /var/log/logFile.log
    pattern: WARN|ERROR

If you have pre-written Fluentd configurations for Fluentbit that do valuable filtering or parsing, you can import them into our logging configuration. To do this, use the config_file and parsers parameters in any .yaml file in your logging.d folder:

  • config_file: path to an existing Fluent Bit configuration file. Any overlapping source results in duplicate messages in New Relic's .
  • parsers_file: path to an existing Fluent Bit parsers file.

The following parser names are reserved: rfc3164, rfc3164-local and rfc5424.

Learning how to inject attributes, or tags, into your logs in your data pipeline and to perform transformations can help with downstream features dropping using New Relic drop rules. By augmenting your logs with metadata about the source, we can make centralized and reversible decisions about what to drop on the backend. At a minimum, make sure the following attributes are present in your logs in some form:

  • Team
  • Environment (dev/stage/prod)
  • Application
  • Data center
  • Log level

Below are some detailed routing and filtering resources:

Adjusting the infrastructure agent's default attribute set

The infrastructure agent adds some attributes by default, including any custom tags added to the host. It's possible that your configurations bring in many more than that, including a large number of AWS tags, which appear in New Relic with the form aws.[attributename]. These attributes are important, so it's highly recommended you evaluate your visualization, analytics, and alerting needs relative to any planned configuration changes. For example, logs from a Kubernetes cluster won't likely be useful without metadata such as:

  • cluster_name

  • pod_name

  • container_name

  • node_name

    <Collapser id="prometheus-metrics-sources" title="Prometheus metrics sources"

<Callout variant="IMPORTANT" title="Growth drivers"

  • Number of metrics exported from apps
  • Number of metrics transferred via remote write or POMI

New Relic provides two primary options for sending Prometheus metrics to New Relic. The best practices for managing metric ingest is focused primarily on the second option - the Prometheus OpenMetrics integration (POMI) - because this component was created by New Relic.

For Prometheus server scrape configuration options, see the Prometheus config docs. These scrape configirations determine which metrics are collected by the Prometheus server. By configuring the remote_write parameter, you can write the collected metrics to the New Relic database (NRDB) via the New Relic Metric API.

POMI is a standalone integration that scrapes metrics from both dynamically discovered and static Prometheus endpoints. POMI then sends this data to NRDB via the New Relic Metric API. This integration is ideal for customers not currently running Prometheus server.

POMI: scrape label

POMI will discover any Prometheus endpoint containing the label or annotation prometheus.io/scrape=true by default. This can be a large number of endpoints and thus, a large number of metrics ingested, depending on what's deployed in the cluster.

It's suggested that you modify the scrape_enabled_label parameter to something custom (e.g. newrelic/scrape), and that you selectively label the Prometheus endpoints when data ingest is of utmost concern.

For the latest reference config, see nri-prometheus-latest.yaml.

POMI config parameter:

# Label used to identify scrapable targets. 
# Defaults to "prometheus.io/scrape"
  scrape_enabled_label: "prometheus.io/scrape"

POMI will discover any Prometheus endpoint exposed at the node-level by default. This typically includes metrics coming from Kubelet and cAdvisor. If you're running the New Relic Kubernetes Daemonset, it's important that you set require_scrape_enabled_label_for_nodes: true so that POMI doesn't collect duplicate metrics.

For the endpoints targeted by the New Relic Kubernetes Daemonset, see our Kubernetes README on GitHub.

POMI: scrape label for nodes

POMI will discover any Prometheus endpoint exposed at the node-level by default. This typically includes metrics coming from Kubelet and cAdvisor. If you're running the New Relic Kubernetes Daemonset, it's important that you set require_scrape_enabled_label_for_nodes: true so that POMI doesn't collect duplicate metrics.

For the endpoints targeted by the New Relic Kubernetes Daemonset, see our Kubernetes README on GitHub.

POMI config parameters

# Whether k8s nodes need to be labeled to be scraped or not. 
# Defaults to false.
  require_scrape_enabled_label_for_nodes: false

POMI: co-existing with nri-kubernetes

New Relic's Kubernetes integration collects a number of metrics out of the box. However, it doesn't collect every possible metric available from a Kubernetes cluster.

In the POMI config, you'll see a section similar to this which will disable metric collection for a subset of metrics that the New Relic Kubernetes integration is already collecting from Kube State Metrics.

It's also very important to set require_scrape_enabled_label_for_node: true so that Kubelet and cAdvisor metrics aren't duplicated.

POMI config parameters:

  transformations:
    - description: "Uncomment if running New Relic Kubernetes integration"
      ignore_metrics:
        - prefixes:
          - kube_daemonset_
          - kube_deployment_
          - kube_endpoint_
          - kube_namespace_
          - kube_node_
          - kube_persistentvolume_
          - kube_persistentvolumeclaim_
          - kube_pod_
          - kube_replicaset_
          - kube_service_
          - kube_statefulset_

POMI: request/limit settings

When running POMI, it's recommended to apply the following resource limits for clusters generating approximately 500k DPM:

  • CPU limit: 1 core (1000m)
  • Memory limit: 1Gb 1024 (1G)

You should set the resource request for CPU and memory to provide POMI with enough resources from the cluster. Setting this to something extremely low (e.g. cpu: 50m) can result in cluster resources being consumed by "noisy neighbors."

POMI config parameter:

spec:
  serviceAccountName: nri-prometheus
  containers:
  - name: nri-prometheus
    image: newrelic/nri-prometheus:2.2.0
    resources:
      requests:
        memory: 512Mi
        cpu: 500m
      limits:
        memory: 1G
        cpu: 1000m

POMI: estimating DPM and cardinality

Although cardinality is not associated directly with billable per GB ingest, New Relic does maintain certain rate limits on cardinality and data points per minute. Being able to visualize cardinality and DPM from a Prometheus cluster can be very important.

New Relic accounts have a 1M DPM and 1M cardinality limit, but you can request up to 15M DPM and 15M cardinality. To request changes, contact your New Relic account representative. For more information, see [Metric API limits](/docs/data-apis/ingest-apis/metric-api/metric-api-limits-restricted-attributes).

If you're already running Prometheus Server, you can run DPM and cardinality estimates prior to enabling POMI or remote_write.

Data points per minute (DPM):

rate(prometheus_tsdb_head_samples_appended_total[10m]) * 60

Top 20 metrics (highest cardinality):

topk(20, count by (<DoNotTranslate>**name**</DoNotTranslate>, job)({__name__=~".+"}))

<Collapser id="cloud-integration" title="Cloud integrations"

<Callout variant="IMPORTANT" title="Growth drivers"

* Number of metrics exported per integration
* Polling frequency (for polling based integration)

Some New Relic cloud integrations get data from cloud providers' APIs. With this implementation, data is collected from monitoring APIs such as AWS CloudWatch, Azure Monitor, and GCP Stackdriver, and inventory metadata is collected from the specific services' APIs.

Other cloud integrations get their data from streaming metrics (or "pushed" metrics) that are pushed via a streaming service such as AWS Kinesis.

Polling API-based integrations

If you want to report more or less data from your cloud integrations, or if you need to control the use of the cloud providers' APIs to prevent reaching the limit rate and throttling limits in your cloud account, you can change the configuration settings to modify the amount of data they report. The two main controls are:

Examples of business reasons for wanting to change your polling frequency include:

  • Billing: If you need to manage your AWS CloudWatch bill, you may want to decrease the polling frequency. Before you do this, make sure that any alert conditions set for your cloud integrations aren't affected by this reduction.
  • New services: If you're deploying a new service or configuration and you want to collect data more often, you may want to increase the polling frequency temporarily.
Changing the configuration settings for your integrations may impact alert conditions and chart trends.

For more details, see Configure polling.

"Streaming" or "pushed" metrics

More and more cloud integrations are offering the option of having data pushed via a streaming service instead of using API polling, which cuts down on latency drastically. One issue some users have observed is that it's not as easy to control volume because you can't configure the sampling rate.

New Relic rules for dropping data are the primary way of filtering out streaming metrics with too high of a volume. However, there are some things that you can do on the cloud provider side to help limit the stream volume.

For example, in AWS it's possible to use condition keys to limit access to CloudWatch* namespaces.

The following policy limits the user to publishing metrics only in the namespace named MyCustomNamespace:

{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Resource": "*",
        "Action": "cloudwatch:PutMetricData",
        "Condition": {
            "StringEquals": {
                "cloudwatch:namespace": "MyCustomNamespace"
            }
        }
    }
}

The following policy allows the user to publish metrics in any namespace except for CustomNamespace2:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": "cloudwatch:PutMetricData"
        },
        {
            "Effect": "Deny",
            "Resource": "*",
            "Action": "cloudwatch:PutMetricData",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:namespace": "CustomNamespace2"
                }
            }
        }
    ]
}

Optimize with drop rules [#optimize-with-drop-rules]

A simple rule for understanding what you can do with drop rules is: If you can query it you can drop it. Drop filter rules help you accomplish several important goals:

  • Lower costs by storing only the logs relevant to your account.
  • Protect privacy and security by removing personal identifiable information (PII).
  • Reduce noise by removing irrelevant events and attributes.
When creating drop rules, you're responsible for ensuring that the rules accurately identify and discard the data that meets the conditions that you've established. You're also responsible for monitoring the rule, as well as the data you disclose to New Relic. Always test and retest your queries and, after you install the drop rule, make sure it works as intended. Creating a dashboard to monitor your data pre- and post-drop will help.

Here's some guidance for using drop rules to optimize data ingest for specific tools:

All New Relic drop rules are implemented by the same backend data model and API. Our log management provides a powerful UI that makes it very easy to create and monitor drop rules.

Previously in this tutorial series, we covered prioritizing telemetry by running through some exercises to show ways in which we could deprecate certain data. Let's revisit this example:

Omit debug logs (knowing they can be turned on if there is an issue) (saves 5%)

Method 1: Log UI

  • Identify the logs we care about using a filter in the Log UI: level: DEBUG.
  • Make sure it finds the logs we want to drop.
  • Check some alternative syntax such as level:debug and log_level:Debug. These variations are common.
  • Under Manage data, click Drop filters, and create and enable a filter named 'Drop debug logs'.
  • Verify the rule works.
  • Create the relevant NRQL query:
    SELECT count(*) FROM Log WHERE `level` = 'DEBUG'
  • Make sure it finds the logs you want to drop.
  • Check variations on the attribute name and value (Debug vs DEBUG).
  • Execute the following NerdGraph statement and make sure it works:
mutation {
    nrqlDropRulesCreate(accountId: YOUR_ACCOUNT_ID, rules: [
        {
            action: DROP_DATA
            nrql: "SELECT * FROM Log WHERE `level` = 'DEBUG'"
            description: "Drops DEBUG logs.  Disable if needed for troubleshooting."
        }
    ])
    {
        successes { id }
        failures {
            submitted { nrql }
            error { reason description }
        }
    }
}

<Collapser id="process-samples" title="Process samples"

Let's implement the recommendation: Drop process sample data in DEV environments.

  • Create the relevant query:

    SELECT * FROM ProcessSample WHERE `env` = 'DEV'
  • Make sure it finds the process samples we want to drop.

  • Check for other variations on env such as ENV and Environment.

  • Check for various of DEV such as Dev and Development.

  • Use our NerdGraph API to execute the following statement and make sure it works:

    mutation {
        nrqlDropRulesCreate(accountId: YOUR_ACCOUNT_ID, rules: [
            {
                action: DROP_DATA
                nrql: "SELECT * FROM ProcessSample WHERE `env` = 'DEV'"
                description: "Drops ProcessSample from development environments"
            }
        ])
        {
            successes { id }
            failures {
                submitted { nrql }
                error { reason description }
            }
        }
    }

<Collapser id="cloud-metrics" title="Cloud metrics"

You can often reduce your data usage by cutting down on data with redundant coverage. For example: in an environment where you have the AWS RDS integration running as well as one of the New Relic on-host integrations that monitor SQL databases such as nri-mysql or nri-postgresql, you may be able to discard some overlapping metrics.

For example, you can run a query like this:

FROM Metric select count(*) where metricName like 'aws.rds%' facet metricName limit max

That will show all metricName values matching the pattern.

You can see from the results there's a high volume of metrics of the pattern aws.rds.cpu%. You can drop those because you have other instrumentation for those:

  • Create the relevant query:
    FROM Metric select * where metricName like 'aws.rds.cpu%' facet metricName limit max since 1 day ago
  • Make sure it finds the process samples you want to drop.
  • Use the NerdGraph API to execute the following statement and make sure it works:
mutation {
    nrqlDropRulesCreate(accountId: YOUR_ACCOUNT_ID, rules: [
        {
            action: DROP_DATA
            nrql: "FROM Metric select * where metricName like 'aws.rds.cpu%' facet metricName limit max since 1 day ago"
            description: "Drops rds cpu related metrics"
        }
    ])
    {
        successes { id }
        failures {
            submitted { nrql }
            error { reason description }
        }
    }
}

<Collapser id="drop-specific-attributes" title="Drop specific attributes"

One important thing about drop rules is that you can configure a rule that drops specific attributes but maintains the integrity of the rest of the data. Use this to remove private data from NRDB, or to drop excessively large attributes, such as stack traces or large chunks of JSON in excessively large log records.

To set these drop rules, change the action field to DROP_ATTRIBUTES instead of DROP_DATA.

mutation {
    nrqlDropRulesCreate(accountId: YOUR_ACCOUNT_ID, rules: [
        {
            action: DROP_ATTRIBUTES
            nrql: "SELECT stack_trace, json_data FROM Log where appName='myApp'"
            description: "Drops large fields from logs for myApp"
        }
    ])
    {
        successes { id }
        failures {
            submitted { nrql }
            error { reason description }
        }
    }
}

<Collapser id="drop-random-sample-of-events" title="Drop random sample of events"

Use this approach carefully, and only in situations where there are no other options, because it can alter take aways from your data. However, for events with massive sample size, you may be satisfied with only a portion of your data as long as you understand the consequences.

In this example, you can take advantage of the relative distribution of certain trace IDs to approximate random sampling. You can use the rlike operator to check for the leading values of a span's trace.id attribute. The following example could drop about 25% of spans:

SELECT * FROM Span WHERE trace.id rlike r'.*[0-3]' and appName = 'myApp'

Useful expressions include:

  • r'.*0' approximates 6.25%
  • r'.*[0-1]' approximates 12.5%
  • r'.*[0-2]' approximates 18.75%
  • r'.*[0-3]' approximates 25.0%

After running out of digits, you can use letters, for example:

  • r'.*[a0-9]' approximates 68.75%
  • r'.*[a-b0-9]' approximates 75.0%

Here's an example of a full mutation:

mutation {
    nrqlDropRulesCreate(accountId: YOUR_ACCOUNT_ID, rules: [
        {
            action: DROP_DATA
            nrql: "SELECT * FROM Span WHERE trace.id rlike r'.*[0-3]' and appName = 'myApp'"
            description: "Drops approximately 25% of spans for myApp"
        }
    ])
    {
        successes { id }
        failures {
            submitted { nrql }
            error { reason description }
        }
    }
}
Since `trace.id`s are hexadecimal numbers, every character of the `trace.id` is a value of `0123456789abcdef`. Each character that you add to the `RLIKE` pattern will match an additional 1/16 of the rows in the span event, assuming the final characters have even distribution. If you add letters beyond F that aren't used in hexadecimal, the added digits won't affect the percentage matched.

<Collapser id="other-events-and-metrics" title="Other events and metrics"

The preceding examples should show you all you need to know to use these techniques on any other event or metric in NRDB. Remember: if you can query it you can drop it. Reach out to us if you have questions about the precise way to structure a query for a drop rule.

What's next? [#whats-next]

With the optimization step complete, you've finished the optimizing telemetry data tutorial! If your account has an account executive, you can contact them for further guidance on where to go next and ensuring you're optimized.

If you're new to the New Relic platform, you can visit our other tutorial series to learn more about optimizing your system using our platform: