Skip to content

Commit

Permalink
Merge pull request #15027 from thezackm/feat/snmp-calculations
Browse files Browse the repository at this point in the history
feat/snmp calculations
  • Loading branch information
ally-sassman committed Oct 31, 2023
2 parents d30e272 + 7045826 commit 5846c89
Show file tree
Hide file tree
Showing 13 changed files with 160 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ metaDescription: How to run a specific version of the container.

The default Docker commands given by the guided installation will update you to the latest release every time you start. There's a variety of scenarios where you might want to run an older release, or pin your environment to a specific version.

## Solutions [#solutions]
## Solution [#solution]

You can find older releases of the container on [Docker Hub](https://hub.docker.com/r/kentik/ktranslate/tags).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ You've run `docker run`, but nothing seems to be happening and you see, in the d

This happens when the `snmp-base.yaml` file has an ownership permission that prevents the docker user from editing the file, most often because you created the file as the `root` user or a similar privileged account. The docker container runs with a non-privileged user that can't modify this file. Inside the container, `ktranslate` is always trying to use user ID 1000 and group ID 1000, so the ownership needs to allow for those IDs to own the file.

## Solutions [#solutions]
## Solution [#solution]

From your privileged account, change the ownership of the file before you can pass it into the docker container. For example, run:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ When a discovery job runs, each time it discovers an IP it tries to determine th

When `ktranslate` performs its polling it only uses the IP when it runs, so if your devices list contains multiple entries with the same IP address, it will collect and send metrics to New Relic as if they were separate entities, but in reality it is just the same data that came from whatever device responded to requests to IP at the current polling interval. The `device_name` is not collected or updated as part of the polling cycle.

## Solutions [#solutions]
## Solution [#solution]

If the `device_name` has changed due to a one-time change, like replacing a piece of hardware or updating your naming conventions, then you should edit the `snmb-base.yaml` and delete the entry with the old device name. The old entity will still show in the **Explorer** menu but will stop having new data associated with it and eventually it will age out of the system. In most cases that will happen 8 days after data stops coming in.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ You've ktranslate containers hitting 100% CPU utilization, or just generally it'
One detail to be careful about is that for ktranslate it's important to focus on the maximum CPU percent instead of the average. Ktranslate uses a high percentage of CPU near the beginning of a polling cycle and much less at the end of the cycle. When you look at the average usage you might see 60% and miss the fact that ktranslate is spending time close to 100%, which is a potential problem. You need to allocate enough resources so that the max CPU consumption doesn't hit 100%.
</Callout>

## Solutions [#solutions]
## Solution [#solution]

The causes of CPU usage vary depending on what type of ktranslate container you're working with.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ metaDescription: Meraki API polling is working, but expected metrics are missing

During Meraki API monitoring, you don't see all of the expected metrics for your controller.

## Solutions [#solutions]
## Solution [#solution]

Identify what metrics exist in New Relic by running the following NRQL query:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ You have a new type of device that has a profile but no entity has been added in

This happens when a device has an [SNMP profile](https://github.com/kentik/snmp-profiles) telling `ktranslate` what metrics to collect, but a new [entity definition](https://github.com/newrelic/entity-definitions) is still in progress for how to display that collection of metrics in New Relic.

## Solutions [#solutions]
## Solution [#solution]

When creating a new entity type, we must review the data that comes in from the profile. That data is used to create a definition that includes information such as the golden metrics for this entity type, and used to create a dashboard. This can take some time and sometimes requires talking to the users who submitted the profile request to ensure the entity definition suits their needs.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This usually happens when a device can't reliably respond to the SNMP requests w

Another scenario might be that the device is overloaded and can't respond to the SNMP requests quickly. This usually happens when you try to collect OIDs from very large tables with a `poll_time_sec` that is too fast for the device to keep up with.

## Solutions [#solutions]
## Solution [#solution]

As a general rule, locate your polling container as close to the monitored devices as you can to reduce the chance of a the UDP SNMP payloads not making the trip.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ metaDescription: SNMP monitoring discovery results in 'Kentik Default' entities

After running a discovery, you are seeing `Kentik Default` entities in the New Relic UI.

## Solutions [#solutions]
## Solution [#solution]

During discovery, `ktranslate` collects the [System Object Identifier](http://oid-info.com/get/1.3.6.1.2.1.1.2), such as sysObjectID or sysOID, which provides an easy way to identify a device. Every device type that responds to SNMP has a sysObjectID, and the value of that ID should be unique enough so anyone can identify which type of device it is.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ metaDescription: SNMP monitoring discovery does not find any devices, or you did

You launched an SNMP discovery run but didn't find all of the expected devices.

## Solutions [#solutions]
## Solution [#solution]

The SNMP discovery process will run against every IP address in your list from the [`cidrs`](/docs/network-performance-monitoring/advanced/advanced-config/#discovery) section in the discovery configuration. During the scan, there's a TCP port check to ensure the target IP address is responsive. If successful, `ktranslate` will then attempt to communicate with the IP address via SNMP.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ There are two distinct scenarios that can exist after this process:
1. Device is matched to an expected profile and collects metrics without issue.
2. Device is unexpectedly matched to the wrong profile and is collecting the wrong metrics or is missing metrics.

## Solutions [#solutions]
## Solution [#solution]

### Kentik Default devices [#kentik-default]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ metaDescription: Gathering details on all supported OIDs for your device using t

You are having trouble collecting SNMP metrics from your device or you need to see what specific Object Identifiers (OIDs) your device supports.

## Solutions [#solutions]
## Solution [#solution]

The [snmpwalk](https://helpmanual.io/help/snmpwalk/) utility is a useful tool for troubleshooting various SNMP challenges you may encounter. Because `ktranslate` runs on the host network of the Linux host that Docker is running on top of, it is an accurate measurement of whether or not your devices are responding to SNMP requests and what specifically they are responding with.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
title: Understanding default SNMP utilization calculations
tags:
- Integrations
- Network monitoring
- Troubleshooting
metaDescription: Understanding how various utilization metrics are calculated in ktranslate.
---

## Problem [#problem]

You have questions about various results calculated by the `ktranslate` network monitoring agent.

## Background [#background]

`ktranslate` returns the raw data collected by SNMP polling in almost every instance with the following caveats:
* CPU utilization %
* Memory utilization %
* Interface utilization %
* Interface error %
* Various metrics with the `enum` or `conversion` functions applied in their configuration

## Solution [#solution]

<CollapserGroup>
<Collapser
id="cpu-utilization"
title="CPU Utilization %"
>
**Metric Name**: `kentik.snmp.CPU`

CPU is generally returned in a direct OID that provides a integer or float value representing percentage utilization. In rare cases, there is only a result for CPU idle, which is [translated to CPU](https://github.com/kentik/ktranslate/blob/72257357db05f36e05389b0a278b702a707a0941/pkg/inputs/snmp/metrics/device_metrics.go#L281-L285) using this formula:

```
CPU = 100 - CPU Idle
```
</Collapser>
<Collapser
id="memory-utilization"
title="Memory Utilization %"
>
**Metric Name**: `kentik.snmp.MemoryUtilization`

Unlike CPU, memory utilization is rarely presented as a direct OID value. To calculate the percent utilization, [ktranslate uses this logic](https://github.com/kentik/ktranslate/blob/72257357db05f36e05389b0a278b702a707a0941/pkg/inputs/snmp/metrics/device_metrics.go#L287-L317):

```
If Memory Used and Memory Free are available:
Memory Utilization = ( Memory Used / (Memory Free + Memory Used) ) * 100
If Memory Total and Memory Free are available:
Memory Utilization = ( (Memory Total - Memory Free) / Memory Total ) * 100
If Memory Total and Memory Used are available:
Memory Utilization = ( Memory Used / Memory Total ) * 100
If Memory Total, Memory Buffer, and Memory Cache are available:
Memory Utilization = ( ( Memory Total - (Memory Buffer + Memory Cache ) ) / Memory Total ) * 100
```
</Collapser>
<Collapser
id="interface-utilization"
title="Interface Utilization %"
>
**Metric Name**: `kentik.snmp.IfInUtilization` | `kentik.snmp.IfOutUtilization`

Interface utilization follows the industry standard approach of calculating the delta in bytes and dividing by the product of the interface's configured speed and the time delta since the last collection was made.

For example, assuming 1 is the previous data point and 2 is the most recent:

> ( ( ifInOctets_2 - ifInOctets_1 ) \* 8 \* 100 ) / ( (sysUptime_2 - sysUptime_1) \* ifSpeed )
[Ktranslate](https://github.com/kentik/ktranslate/blob/72257357db05f36e05389b0a278b702a707a0941/pkg/formats/nrm/nrm.go#L581-L623), uses the value of either [ifHCInOctets](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.31.1.1.1.6) (receive) or [ifHCInOctets](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.31.1.1.1.10) (transmit); replacing `ifSpeed` in the denominator with the value of [ifHighSpeed](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.31.1.1.1.15) as needed:

```
( inBytes * 8 * 100 ) / ( uptime * ( ifSpeed * 10000 ) )
or
( outBytes * 8 * 100 ) / ( uptime * ( ifSpeed * 10000 ) )
```

<Callout variant="tip">
A common reason for seeing inaccurate interface utilization percentages is the configured interface speed on the device doesn't reflect the real interface speed. For instance, a 1Gb MPLS circuit on a 10Gb interface would show percentages at only 10% of the real utilization. To resolve this, consult your vendor's documentation on setting the interface bandwidth.
</Callout>
</Collapser>
<Collapser
id="interface-errors"
title="Interface Errors %"
>
**Metric Name**: `kentik.snmp.ifInErrorPercent` | `kentik.snmp.ifOutErrorPercent`

Interface error percentage uses the value of either [ifInErrors](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.2.2.1.14) (receive) or [ifOutErrors](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.2.2.1.20) (transmit), divided by either [ifHCInUcastPkts](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.31.1.1.1.7) (receive) or [ifHCOutUcastPkts](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.31.1.1.1.11) (transmit). [In ktranslate](https://github.com/kentik/ktranslate/blob/72257357db05f36e05389b0a278b702a707a0941/pkg/inputs/snmp/metrics/interface_metrics.go#L255-L271), the formula looks like this:

```
( ifInErrors / ifHCInUcastPkts ) * 100
or
( ifOutErrors / ifHCOutUcastPkts ) * 100
```
</Collapser>
<Collapser
id="snmp-conversions"
title="SNMP conversions"
>
**Metric Name**: Various

Other SNMP metrics are converted based on the existence of the `enum` and `conversion` functions in their respective [SNMP profile](https://github.com/kentik/snmp-profiles/blob/main/profiles/kentik_snmp/_template.yml).

<table>
<thead>
<tr>
<th style={{ width: "450px" }}> Profile Setting </th>
<th> Usage </th>
</tr>
</thead>
<tbody>
<tr>
<td> `enum:[]` </td>
<td> Used to handle SNMP enumerations which convert the integer value of a dimensional metric into the enumerated value in an attribute decorated on the dimensional metric (using the same metric name suffix). A common example is the conversion of [kentik.snmp.if_AdminStatus](https://oid-rep.orange-labs.fr/get/1.3.6.1.2.1.2.2.1.7) to the enumerated value of [if_AdminStatus](https://github.com/kentik/snmp-profiles/blob/ccb1df47a5068a59fb3e3765746524e0286252e7/profiles/kentik_snmp/_general/if-mib.yml#L59-L66) as either `up`, `down`, or `testing`. </td>
</tr>
<tr>
<td> `conversion: hextoint: <current>: <desired>` </td>
<td> Used to convert hexadecimal values into integer format. Options for **current**: `LittleEndian` | `BigEndian`. Options for **desired**: `uint16` | `uint32` | `uint64` </td>
</tr>
<tr>
<td> `conversion: hextoip` </td>
<td> Used to convert hexadecimal values into 4-octet IPv4 strings. </td>
</tr>
<tr>
<td> `conversion: hwaddr` </td>
<td> Used to convert hexadecimal values into MAC address strings. </td>
</tr>
<tr>
<td> `conversion: powerset_status` </td>
<td> Used for enumeration of the [upsBasicStateOutputState](https://oid-rep.orange-labs.fr/get/1.3.6.1.4.1.318.1.1.1.11.1.1) ASCII string in the `POWERNET-MIB`. </td>
</tr>
<tr>
<td> `conversion: regexp` </td>
<td> Places a regex match on the OID output to capture substrings; needs to be wrapped in quotes and have backslashes escaped.<br />Example OID result: `" 5 Secs ( 96.3762%) 60 Secs ( 62.8549%) 300 Secs ( 25.2877%)"`<br />Example conversion: `"regexp:60 Secs.*?(\\d+)"`<br />Final result: `62` </td>
</tr>
<tr>
<td> `conversion: to_one` </td>
<td> Used to create a gauge metric with the value of `1` in order to poll non-numeric scalar OIDs that don't have enumeration options. An example is the [tlUpsTestResultsDetail](https://oid-rep.orange-labs.fr/get/1.3.6.1.4.1.850.100.1.7.2) OID which returns a value of the type [DisplayString](https://www.circitor.fr/Mibs/Html/S/SNMPv2-TC.php#DisplayString). </td>
</tr>
</tbody>
</table>
</Collapser>
</CollapserGroup>
6 changes: 4 additions & 2 deletions src/nav/network-performance-monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ pages:
pages:
- title: Advanced config for network monitoring
path: /docs/network-performance-monitoring/advanced/advanced-config
- title: KTranslate Docker container management
path: /docs/network-performance-monitoring/advanced/ktranslate-container-management
- title: Creating and managing SNMP profiles
path: /docs/network-performance-monitoring/advanced/snmp-profiles
- title: KTranslate Docker container management
path: /docs/network-performance-monitoring/advanced/ktranslate-container-management
- title: KTranslate Docker health monitoring
path: /docs/network-performance-monitoring/advanced/ktranslate-container-health
- title: Troubleshooting network monitoring
Expand Down Expand Up @@ -64,4 +64,6 @@ pages:
path: /docs/network-performance-monitoring/troubleshooting/snmp-discovery-kentik-default
- title: SNMP monitoring results have metrics missing
path: /docs/network-performance-monitoring/troubleshooting/snmp-polling-missing-metrics
- title: Understanding SNMP calculations
path: /docs/network-performance-monitoring/troubleshooting/understanding-snmp-utilization-calculations

0 comments on commit 5846c89

Please sign in to comment.