Add NetApp E-Series exporter#1790
Conversation
Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>
|
I've not direct experience of NetApps. Looking at the readme does this apply to only this model, or all NetApps? Is this more of a "SANtricity Web Services Proxy" exporter? Is there a way to get these metrics without that proxy? I'd like to make sure that if we list something it's appropriately named so users can find it. Some comments to help you improve your exporter:
Prometheus passes the scape timeout in as a header, you might be able to use that rather than requiring the user to configure it. The exporter.use-cache flag should be removed, the correct action when data collection fails is to fail the entire scrape - not silently serve stale or partial data. Some metrics have inconsistent and non-base units, for example milliseconds, millibits, and percentages. You should use seconds, bytes, and ratios everywhere. Several metrics appear to be calculated gauges, if the raw counters are available from the API it'd be better to use them instead. read_iops is listed as a gauge, but sounds like a counter. If it is a counter it should end in _total. drive_status seems to be mixing different types of information in one metric - the status should not be both a label and the metric value which will be difficult for users to use correctly. Based on the help string, I think the correct way to handle this is to remove the status label as there's only 0&1 states. storage_system_status looks to have the same issue. The system-statistics appear to have an id label across all of them, which indicates it is actually a target label which it is the responsibility of Prometheus's service discovery to deal with. If you wish to expose this, it should be on a single info metric. Similarly for systemid. |
The web service is just a proxy to the API exposed by the E-Series. The NetApp E-Series is a type of storage. So this works by querying the E-Series API via the SANtricity proxy. NetApp does not document how to talk to E-Series API directly. This would not work on all NetApp like we have NetApp ONTAP systems of various models and those have entirely different methods for collecting data and actually deploy different exporter for that entirely different system. As for models supported, that's a good question and not sure on that one. I think the Proxy support matrix may determine what E-Series models work and NetApp has lots of support matrix pages.
I have tended to use my own timeouts to kill off collection before the exporter gets marked down, so I can actually have a metric exposed that tells me about a timeout and so I can alert when the code times out vs the exporter being down. This gives me insight as to when exporters are timing out vs totally down.
I realize this breaks norms but it was done for very specific reason. If I have an alert that tells me when an E-Series has a failed drive, and the E-Series has a hickup and returns no data, Prometheus and alertmanager interrupt no data as the alert being resolved and I get a resolved email. That's very bad when a team of people are responsible for hardware and if they see a resolved email they may not take action to address the thing that's still a problem. I have not found another way to avoid false resolves other than caching data that is more about state and not about performance metrics. Using Unless I was overly aggressive with caching, the caching should only be for things that are state, like if a drive is healthy. The things like bytes read or IOPS will not be cached. This was more geared towards caching things that would likely be alerts.
The units I used were whatever comes out of the E-Series API. Also I am browsing code and everything as best I can remember is just taking JSON data from API and maybe adding labels for context around the value and then making that the Gauge. I am not finding where calculations are taking place but I haven't looked at this code in a bit of time so maybe I missed in my scan.
Based on way this data looks in Grafana, it's a rate from the raw source so just showing
The status label is important as it can be many things and putting that status into an alertmanager alert gives myself and others on my team a quick way to know what actually has gone wrong. Would it be better to keep the status label and have the metric be more like
The ID value is coming from the E-Series API. I'd want to keep that ID on all things so I could do something like this: The value for ID is what's used to actually configure the Web Proxy to make API calls so I did this for example: That added my E-Series to the web proxy service so I could make API calls. It happens to be the value for target in Prometheus configs, so are you suggesting using Prometheus configs to add that label vs exporting it from exporter? |
The APIs you're hitting don't have "e series" in their paths, so I suspect this may be more broadly applicable.
That indicates a fragile alert that needs fixing, not a reason to go against best practices and make an exporter much harder to use correctly generally. See https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0
That's a different problem, see https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
That's generally not very interesting, as an exporter being down is very rare - particularly if you've a supervisor. It's generally best to keep it simple. If scrapes are regularly failing, you have bigger problems.
That would be an acceptable tradeoff if you were treating all the API attributes in bulk, however given that each metric already has a few lines of code dedicated to it including specifying the exact metrics name you may aswell get this right while you're at it.
It's not uncommon unfortunately that devices don't provide raw counters, if they don't then there's not much we can do about it here.
Info metrics shouldn't change their labels over time, as you want to avoid series appearing and disappearing. How many potential status values are there, and do you know them a-priori?
Definitely a target label rather than an instrumentation then, however it also sounds like it's equivalent to the instance label in identity terms so should be an info metric at most. However that's not what I'm seeing in the code, it never makes post requests. If you want this on all metrics, the only way to do that correctly is in Prometheus as otherwise you'll miss it on up. Thus this is a matter for service discovery and relabelling, not something an exporter should be deciding for users. |
I am almost certain the API and the structure of the JSON returned is specific to E-Series. The proxy itself might be able to return other device data and the URL paths exposed by the proxy don't include model specific strings but the JSON structure in documentation is specific to E-Series, eg: https://library.netapp.com/ecmdocs/ECMLP2860209/html/index.html, at the top:
I wish I had come across that when I initially ran into the issues with gauge values doing false resolves. I think the use of
The main reason I like error and timeout metrics is it allows Prometheus to tell me if something is failing to collect data without me having to use a separate system like Splunk to scrape the logs. We have had times in the past with this exporter where the exporter was still running but the network connection to E-Series was lost so we got alerts about errors collecting data and timeouts vs just the exporter being down. Had only the exporter down alerts fired we would have begun troubleshooting the wrong thing.
I will work to get the units of the various metrics standardized.
I believe the statuses are known ahead of time, as the API documentation for a Drive for example documents all the possible statuses. https://library.netapp.com/ecmdocs/ECMLP2860209/html/index.html#_drive Given the statuses are known, what's the right pattern to take? Have like
I'll move the ID label to Prometheus configs relabeling and just document in README how to add to Prometheus configs. |
It's pretty obvious which is the case from the Prometheus target page, but in general alerts should be symptom based rather than than telling you exact causes as that's ultimately more maintainable and covers more failure modes. It should also be apparent what the issue is when you manually visit the /metrics. As I said, it's very rare that it's the exporter itself will be broken - so only a quick check that it isn't is required before continuing on with debugging the target.
There's 10 states, so that should be okay. I saw the pattern used correctly in one of your exporters. |
|
@brian-brazil I think I've made all the necessary changes. I removed all caching, removed ID related labels, and standardized units for all metrics. Some of the things like |
The point is more that percentages are not base units, ratios are, so they should be adjusted accordingly. I note that there seem to be both writeBytesTotal and fullStripeWritesBytes counters in the docs, so those should be exposed instead of FullStripeWritesBytesPercent as raw data is preferable. Nor is a ratio of counters since the machine started particularly useful. |
It looks like I need to refactor where I pull metrics from, which I'm working on right now. The API endpoints I'm currently using are analyzed metrics rather than raw counters. So right now I will hit |
|
Yeah, many of them seemed to be lacking raw unfortunately but best to use raw where you can. |
|
I think I've made the necessary adjustments. One of the really useful looking API endpoints of |
|
I just see some small things that could be improved. In system-statistics cpu_max_utilization would be a better name, as it's going from least to most specific and also matches cpu_avg_utilization. The unit of ratio should also be mentioned (and divide if they're being reported as percentages). Similarly over in controller-statistics, and there's also an inconsistency between avg and average. max_possible_throughput is lacking a unit. Units go at the end, so raidX_transferred_bytes_total. |
|
I've made the requested adjustments. The documentation indirectly makes me believe CPU utilizations are reported as percents between 1-100 so just divided those values by 100 and mentioned ratios are 0.0-1.0 in HELP. |
|
Giving it a quick once over: For the various time counters in drive-statistics, the unit should be in the name. From the docs it looks to be microseconds, so will need conversion. The utilization metrics in system and controller-statistics should have a _ratio unit. |
|
I made the necessary adjustments. I am not 100% sure the time metric counters are microseconds, the docs that describe some of them as microseconds are for InterfaceStatistics and not DiskIOStats type but it seems reasonable that if one set of raw counters uses microseconds, the others would too. Once I deploy this this code I should be able to compare Prometheus rates of those counters with the analyzed stats from API to see if things line up. |
|
You forgot to add seconds for the 4 _time_total metrics in drive-statistics. |
|
I fixed those time metrics to include the seconds unit in name. |
|
Thanks! |
Adds link for NetApp E-Series exporter.