Add NetApp E-Series exporter by treydock · Pull Request #1790 · prometheus/docs

treydock · 2020-11-06T19:41:12Z

Adds link for NetApp E-Series exporter.

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>

brian-brazil · 2020-11-06T21:30:53Z

I've not direct experience of NetApps. Looking at the readme does this apply to only this model, or all NetApps? Is this more of a "SANtricity Web Services Proxy" exporter? Is there a way to get these metrics without that proxy? I'd like to make sure that if we list something it's appropriately named so users can find it.

Some comments to help you improve your exporter:

If no timeout is defined the default is 10.

Prometheus passes the scape timeout in as a header, you might be able to use that rather than requiring the user to configure it.

The exporter.use-cache flag should be removed, the correct action when data collection fails is to fail the entire scrape - not silently serve stale or partial data.

Some metrics have inconsistent and non-base units, for example milliseconds, millibits, and percentages. You should use seconds, bytes, and ratios everywhere.

Several metrics appear to be calculated gauges, if the raw counters are available from the API it'd be better to use them instead.

read_iops is listed as a gauge, but sounds like a counter. If it is a counter it should end in _total.

drive_status seems to be mixing different types of information in one metric - the status should not be both a label and the metric value which will be difficult for users to use correctly. Based on the help string, I think the correct way to handle this is to remove the status label as there's only 0&1 states. storage_system_status looks to have the same issue.

The system-statistics appear to have an id label across all of them, which indicates it is actually a target label which it is the responsibility of Prometheus's service discovery to deal with. If you wish to expose this, it should be on a single info metric. Similarly for systemid.

treydock · 2020-11-06T22:40:23Z

I've not direct experience of NetApps. Looking at the readme does this apply to only this model, or all NetApps? Is this more of a "SANtricity Web Services Proxy" exporter? Is there a way to get these metrics without that proxy? I'd like to make sure that if we list something it's appropriately named so users can find it.

The web service is just a proxy to the API exposed by the E-Series. The NetApp E-Series is a type of storage. So this works by querying the E-Series API via the SANtricity proxy. NetApp does not document how to talk to E-Series API directly.

This would not work on all NetApp like we have NetApp ONTAP systems of various models and those have entirely different methods for collecting data and actually deploy different exporter for that entirely different system.

As for models supported, that's a good question and not sure on that one. I think the Proxy support matrix may determine what E-Series models work and NetApp has lots of support matrix pages.

Some comments to help you improve your exporter:

If no timeout is defined the default is 10.

Prometheus passes the scape timeout in as a header, you might be able to use that rather than requiring the user to configure it.

I have tended to use my own timeouts to kill off collection before the exporter gets marked down, so I can actually have a metric exposed that tells me about a timeout and so I can alert when the code times out vs the exporter being down. This gives me insight as to when exporters are timing out vs totally down.

The exporter.use-cache flag should be removed, the correct action when data collection fails is to fail the entire scrape - not silently serve stale or partial data.

I realize this breaks norms but it was done for very specific reason. If I have an alert that tells me when an E-Series has a failed drive, and the E-Series has a hickup and returns no data, Prometheus and alertmanager interrupt no data as the alert being resolved and I get a resolved email. That's very bad when a team of people are responsible for hardware and if they see a resolved email they may not take action to address the thing that's still a problem. I have not found another way to avoid false resolves other than caching data that is more about state and not about performance metrics. Using absent() is far from ideal given the vast number of things we monitor and how absent() works does not lend itself to large installations and would still not get around the false resolve as best as I've been able to find.

Unless I was overly aggressive with caching, the caching should only be for things that are state, like if a drive is healthy. The things like bytes read or IOPS will not be cached. This was more geared towards caching things that would likely be alerts.

Some metrics have inconsistent and non-base units, for example milliseconds, millibits, and percentages. You should use seconds, bytes, and ratios everywhere.

Several metrics appear to be calculated gauges, if the raw counters are available from the API it'd be better to use them instead.

The units I used were whatever comes out of the E-Series API. Also I am browsing code and everything as best I can remember is just taking JSON data from API and maybe adding labels for context around the value and then making that the Gauge. I am not finding where calculations are taking place but I haven't looked at this code in a bit of time so maybe I missed in my scan.

read_iops is listed as a gauge, but sounds like a counter. If it is a counter it should end in _total.

Based on way this data looks in Grafana, it's a rate from the raw source so just showing eseries_system_read_iops shows ups and downs so I do not believe these are actual counters, so my guess is the E-Series data is doing it's own rate calculations.

drive_status seems to be mixing different types of information in one metric - the status should not be both a label and the metric value which will be difficult for users to use correctly. Based on the help string, I think the correct way to handle this is to remove the status label as there's only 0&1 states. storage_system_status looks to have the same issue.

The status label is important as it can be many things and putting that status into an alertmanager alert gives myself and others on my team a quick way to know what actually has gone wrong. Would it be better to keep the status label and have the metric be more like storage_system_status_info with the value always being 1? I have seen that pattern in several exporters.

The system-statistics appear to have an id label across all of them, which indicates it is actually a target label which it is the responsibility of Prometheus's service discovery to deal with. If you wish to expose this, it should be on a single info metric. Similarly for systemid.

The ID value is coming from the E-Series API. I'd want to keep that ID on all things so I could do something like this:

eseries_system_write_iops{id="e5660-02"}

The value for ID is what's used to actually configure the Web Proxy to make API calls so I did this for example:

curl -X POST -u admin:PASSWORD "http://localhost:8080/devmgr/v2/storage-systems" -H  "accept: application/json" -H  "Content-Type: application/json" -d "{  \"id\": \"e5660-02\",  \"controllerAddresses\": [    \"10.10.2.103\" , \"10.10.2.104\"  ],  \"acceptCertificate\": true,  \"validate\": false,  \"password\": \"OMIT\"}"

That added my E-Series to the web proxy service so I could make API calls. It happens to be the value for target in Prometheus configs, so are you suggesting using Prometheus configs to add that label vs exporting it from exporter?

brian-brazil · 2020-11-06T23:23:21Z

So this works by querying the E-Series API via the SANtricity proxy.

The APIs you're hitting don't have "e series" in their paths, so I suspect this may be more broadly applicable.

Prometheus and alertmanager interrupt no data as the alert being resolved and I get a resolved email.

That indicates a fragile alert that needs fixing, not a reason to go against best practices and make an exporter much harder to use correctly generally. See https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

That's very bad when a team of people are responsible for hardware and if they see a resolved email they may not take action to address the thing that's still a problem.

That's a different problem, see https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
Suboptimal alert handling practices shouldn't determine exporter design, exporters should provide the best metrics they can.

This gives me insight as to when exporters are timing out vs totally down.

That's generally not very interesting, as an exporter being down is very rare - particularly if you've a supervisor. It's generally best to keep it simple. If scrapes are regularly failing, you have bigger problems.

The units I used were whatever comes out of the E-Series API. Also I am browsing code and everything as best I can remember is just taking JSON data from API and maybe adding labels for context around the value and then making that the Gauge.

That would be an acceptable tradeoff if you were treating all the API attributes in bulk, however given that each metric already has a few lines of code dedicated to it including specifying the exact metrics name you may aswell get this right while you're at it.

I am not finding where calculations are taking place but I haven't looked at this code in a bit of time so maybe I missed in my scan.

It's not uncommon unfortunately that devices don't provide raw counters, if they don't then there's not much we can do about it here.

Would it be better to keep the status label and have the metric be more like storage_system_status_info with the value always being 1?

Info metrics shouldn't change their labels over time, as you want to avoid series appearing and disappearing. How many potential status values are there, and do you know them a-priori?

That added my E-Series to the web proxy service so I could make API calls.

Definitely a target label rather than an instrumentation then, however it also sounds like it's equivalent to the instance label in identity terms so should be an info metric at most. However that's not what I'm seeing in the code, it never makes post requests.

If you want this on all metrics, the only way to do that correctly is in Prometheus as otherwise you'll miss it on up. Thus this is a matter for service discovery and relabelling, not something an exporter should be deciding for users.

treydock · 2020-11-07T15:14:40Z

The APIs you're hitting don't have "e series" in their paths, so I suspect this may be more broadly applicable.

I am almost certain the API and the structure of the JSON returned is specific to E-Series. The proxy itself might be able to return other device data and the URL paths exposed by the proxy don't include model specific strings but the JSON structure in documentation is specific to E-Series, eg: https://library.netapp.com/ecmdocs/ECMLP2860209/html/index.html, at the top:

Documentation for the RESTful API provided for management and monitoring of E-Series storage-systems.

That indicates a fragile alert that needs fixing, not a reason to go against best practices and make an exporter much harder to use correctly generally. See https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

I wish I had come across that when I initially ran into the issues with gauge values doing false resolves. I think the use of avg_over_time should solve my problems and allow me to remove all caching, which I'm working on.

That's generally not very interesting, as an exporter being down is very rare - particularly if you've a supervisor. It's generally best to keep it simple. If scrapes are regularly failing, you have bigger problems.

The main reason I like error and timeout metrics is it allows Prometheus to tell me if something is failing to collect data without me having to use a separate system like Splunk to scrape the logs. We have had times in the past with this exporter where the exporter was still running but the network connection to E-Series was lost so we got alerts about errors collecting data and timeouts vs just the exporter being down. Had only the exporter down alerts fired we would have begun troubleshooting the wrong thing.

That would be an acceptable tradeoff if you were treating all the API attributes in bulk, however given that each metric already has a few lines of code dedicated to it including specifying the exact metrics name you may aswell get this right while you're at it.

I will work to get the units of the various metrics standardized.

Info metrics shouldn't change their labels over time, as you want to avoid series appearing and disappearing. How many potential status values are there, and do you know them a-priori?

I believe the statuses are known ahead of time, as the API documentation for a Drive for example documents all the possible statuses. https://library.netapp.com/ecmdocs/ECMLP2860209/html/index.html#_drive

Given the statuses are known, what's the right pattern to take? Have like eseries_drive_status_info{status="<status string>"} for each possible status and just set the value of 1 for the status that is currently reported by the API and 0 for all others? I believe I've seen that pattern with SNMP exporter.

If you want this on all metrics, the only way to do that correctly is in Prometheus as otherwise you'll miss it on up. Thus this is a matter for service discovery and relabelling, not something an exporter should be deciding for users.

I'll move the ID label to Prometheus configs relabeling and just document in README how to add to Prometheus configs.

brian-brazil · 2020-11-07T15:29:00Z

Had only the exporter down alerts fired we would have begun troubleshooting the wrong thing.

It's pretty obvious which is the case from the Prometheus target page, but in general alerts should be symptom based rather than than telling you exact causes as that's ultimately more maintainable and covers more failure modes. It should also be apparent what the issue is when you manually visit the /metrics. As I said, it's very rare that it's the exporter itself will be broken - so only a quick check that it isn't is required before continuing on with debugging the target.

for each possible status and just set the value of 1 for the status that is currently reported by the API and 0 for all others?

There's 10 states, so that should be okay. I saw the pattern used correctly in one of your exporters.

treydock · 2020-11-07T17:28:15Z

@brian-brazil I think I've made all the necessary changes. I removed all caching, removed ID related labels, and standardized units for all metrics. Some of the things like bytes_per_second is necessary because the raw value is actually MB/s per docs so just converted to bytes. Some metrics that are percents had to stay percents because that's what the API provides and those metrics are useful in their current form, like percent of IO that is random as reported by E-Series. I did remove the _ops metrics because I think their rate was based on a configuration value on the proxy for some intervals and no way to query that value plus they were redundant to _iops metrics. Also made all status metrics always show all possible statuses (plus unknown) and set the one reported to value of 1 and all others to 0.

brian-brazil · 2020-11-09T12:47:33Z

Some metrics that are percents had to stay percents because that's what the API provides and those metrics are useful in their current form, like percent of IO that is random as reported by E-Series

The point is more that percentages are not base units, ratios are, so they should be adjusted accordingly. I note that there seem to be both writeBytesTotal and fullStripeWritesBytes counters in the docs, so those should be exposed instead of FullStripeWritesBytesPercent as raw data is preferable. Nor is a ratio of counters since the machine started particularly useful.

treydock · 2020-11-10T14:13:16Z

The point is more that percentages are not base units, ratios are, so they should be adjusted accordingly. I note that there seem to be both writeBytesTotal and fullStripeWritesBytes counters in the docs, so those should be exposed instead of FullStripeWritesBytesPercent as raw data is preferable. Nor is a ratio of counters since the machine started particularly useful.

It looks like I need to refactor where I pull metrics from, which I'm working on right now. The API endpoints I'm currently using are analyzed metrics rather than raw counters. So right now I will hit .../analysed-drive-statistics but there is also a .../drive-statistics API endpoint that exposes the raw counters. There are a few analyzed metrics I can't see how to derive from raw counters so I will keep those but none of those are percents, the percents can all be replaced by raw counters.

brian-brazil · 2020-11-10T15:12:56Z

Yeah, many of them seemed to be lacking raw unfortunately but best to use raw where you can.

treydock · 2020-11-10T20:27:01Z

I think I've made the necessary adjustments. One of the really useful looking API endpoints of /live-statistics does not exist with the 2 newest versions of the SANtricity web proxy API so some of the raw counters for entire systems were only accessible with queries to something like /historical-statistics/raw?type=storageSystem&start=$(date +%s)&length=60000 and I didn't want to do something that complicated or something that looked to be pulling historical numbers. I instead added controller counters. On my systems each system is made of 2 controllers so in theory the sum of controller metrics will give you system metrics. The current metrics that are gauges are the only metrics I could not find raw counters for, everything else was moved to a counter. I did remove all percentages as could replace them all with raw counters.

brian-brazil · 2020-11-11T13:19:26Z

I just see some small things that could be improved. In system-statistics cpu_max_utilization would be a better name, as it's going from least to most specific and also matches cpu_avg_utilization. The unit of ratio should also be mentioned (and divide if they're being reported as percentages).

Similarly over in controller-statistics, and there's also an inconsistency between avg and average. max_possible_throughput is lacking a unit. Units go at the end, so raidX_transferred_bytes_total.

treydock · 2020-11-11T19:52:16Z

I've made the requested adjustments. The documentation indirectly makes me believe CPU utilizations are reported as percents between 1-100 so just divided those values by 100 and mentioned ratios are 0.0-1.0 in HELP.

brian-brazil · 2020-11-12T09:59:47Z

Giving it a quick once over:

For the various time counters in drive-statistics, the unit should be in the name. From the docs it looks to be microseconds, so will need conversion.

The utilization metrics in system and controller-statistics should have a _ratio unit.

treydock · 2020-11-12T14:39:13Z

I made the necessary adjustments. I am not 100% sure the time metric counters are microseconds, the docs that describe some of them as microseconds are for InterfaceStatistics and not DiskIOStats type but it seems reasonable that if one set of raw counters uses microseconds, the others would too. Once I deploy this this code I should be able to compare Prometheus rates of those counters with the analyzed stats from API to see if things line up.

brian-brazil · 2020-11-12T16:13:30Z

You forgot to add seconds for the 4 _time_total metrics in drive-statistics.

treydock · 2020-11-12T16:33:57Z

I fixed those time metrics to include the seconds unit in name.

brian-brazil · 2020-11-12T18:14:19Z

Thanks!

Add NetApp E-Series exporter

8a125f3

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>

beorn7 added the exporters and integrations Requests for new entries in the list of exporters and integrations label Nov 9, 2020

brian-brazil merged commit 18fb666 into prometheus:master Nov 12, 2020

Conversation

treydock commented Nov 6, 2020

Uh oh!

brian-brazil commented Nov 6, 2020

Uh oh!

treydock commented Nov 6, 2020

Uh oh!

brian-brazil commented Nov 6, 2020

Uh oh!

treydock commented Nov 7, 2020

Uh oh!

brian-brazil commented Nov 7, 2020

Uh oh!

treydock commented Nov 7, 2020

Uh oh!

brian-brazil commented Nov 9, 2020

Uh oh!

treydock commented Nov 10, 2020

Uh oh!

brian-brazil commented Nov 10, 2020

Uh oh!

treydock commented Nov 10, 2020

Uh oh!

brian-brazil commented Nov 11, 2020

Uh oh!

treydock commented Nov 11, 2020

Uh oh!

brian-brazil commented Nov 12, 2020

Uh oh!

treydock commented Nov 12, 2020

Uh oh!

brian-brazil commented Nov 12, 2020

Uh oh!

treydock commented Nov 12, 2020

Uh oh!

brian-brazil commented Nov 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants