Replies: 5 comments 3 replies
-
Set allocated_draw to 500W on both power ports. Connect PSU A to PDU A on "primary" power feed and PSU B to PDU B on "redundant" power feed. Then don't count the "redundant" power feed towards the data centre contracts
✅ It does give the peak utilization when either PDU is out of service
❌ AFAICS, netbox doesn't take into account "primary" or "redundant" when working out total rack power utilization; power will be double-counted
❌ It isn't exactly accurate; in reality, normally the device will draw 250W from each
❌ You may have some single-PSU devices in the rack, and connect some to PDU A and others to PDU B. This would then fail to count the usage of the "B" devices towards the total
While there is a list of possible enhancements in your detailed document, I think there is a missing column in the data for storing an actual utilization number (scraped via external process over SNMP, API, etc.) and a need to mark and elide redundant data when aggregating at a higher level. I think you are right in that the way this should work is the max should be the rating for the power supply (1100W or whatever) the allocation should be how much power needs to be reserved to run the device off this sole power supply (500W) and an actual value (maybe just a custom_field for those who want it) to show measured utilization (250W). I am remined of PoE classes where the edge device might draw 5w but need to be allocated 15w based on its class and the switch needs to reserve enough for the total draw possible or it'll refuse to enable more. In this case its about modeling allocation for the PDUs so you don't exceed 50% of their capacity and they can provide useful redundancy of a circuit outage.
so add a flag for redundant connections so they don't get double-counted at the power-panel level and add a custom_field for caching actual current data, to cross-check your allocations. Then you can see actual, allocated and max possible utilization at the power panel level.
—
Mark Tinberg ***@***.***>
Division of Information Technology-Network Services
University of Wisconsin-Madison
|
Beta Was this translation helpful? Give feedback.
-
The trouble with automatically ingesting measured power draw is:
Sure, this probably doesn't belong exactly in core but having a plugin or custom_field to scrape this data so you can report on discrepancies between the model and reality (kind of like how the napalm/lldp audit was moved to a plugin) seems useful. I think OpenDCIM does something like that, but you are right it probably doesn't belong in core.
I don't think that's what I was suggesting; the power supply rating is not something I had considered. I guess it could be of interest, but IMO it probably belongs in the module type or inventory item type, since it's not something you can adjust - it's just part of the device specification. I don't think it is very meaningful to sum it across all devices. It's very much a worst-case figure: if you stuffed a single server with the maximum number of GPU cards or hard drives or whatever, whether the PSU itself could cope.
I was thinking (not clearly as I haven't tackled this problem seriously yet, its on the backlog somewhere) of needing to know the power supply capacity more for tracking PoE switch utilization as that's the more pressing issue for my site, what's the max draw/heat load in a room and what would happen if someone filled the access stack with a bunch of PoE devices (cameras, WAPs, cardaccess, phones, etc.). Would we need to swap out power supplies (750W with 1100W) on how many stack members and would we need facilities to pull new electric circuits or cooling to a room.
However, Netbox's current summing of these will show the device using 500+500=1000W, when it's only using 500W, so as you say, you'd need to exclude the double-counted values somehow. I don't think marking a particular power feed as "redundant" is ideal, because you may have some devices that are not dual-powered. For example, I sometimes have a redundant pair of routers each with single PSU, and connect one to the A feed and one to the B.
I was thinking of marking the power supply/power port as redundant or not rather than the power outlet or feed, and it looks like the current model has redundancy as part of the feed, which is great except for single-homed devices who's only power comes from the "redundant" feed, as you pointed out. Redundancy happens at the port->outlet level a lot, I suppose it happens at the feed level if you use an ATS in front of the PDU. That's it, get an ATS and another PDU and then you can put your single-homed devices behind a dual-homed PDU behind a primary/redundant PDU behind the power feed and then you just need to solve the recursive lookup problem (kidding).
—
Mark Tinberg ***@***.***>
Division of Information Technology-Network Services
University of Wisconsin-Madison
…________________________________
From: Brian Candler ***@***.***>
Sent: Thursday, June 8, 2023 4:11 PM
To: netbox-community/netbox ***@***.***>
Cc: Mark Tinberg ***@***.***>; Comment ***@***.***>
Subject: Re: [netbox-community/netbox] Documenting the "power utilization" features of Netbox (Discussion #12837)
The trouble with automatically ingesting measured power draw is:
1. It can vary widely from second to second, e.g. depending on CPU active/idle
2. It goes against the Netbox philosophy of "source of truth, not a monitoring system"
3. At the device level, it only works if you have intelligent PDUs. (In most places, I only get total power from the PDU, not per socket)
If you could get per-port figures, then it would be reasonable to monitor the average draw over a week or month, and use that to guide what the "allocated draw" is configured as.
In reality, I think the measured draw has much more concrete value anyway, as that's what the data centre will charge on - and it's likely to be an aggregate (per PDU / power feed), rather than at the level of individual devices.
The value of the per-device power draw modelling is that it can predict draw when planning new capacity. If you can create a set of per-device-type allocations which sum to roughly the right amount (i.e. match the measured value for existing kit), then you can get an idea how many new devices you could add before you bust your power budget.
Manufacturer specs are likely to err on the high side, so I think you're right that to be useful, these power draw figures should be representative of real devices in real service.
I think you are right in that the way this should work is the max should be the rating for the power supply (1100W or whatever)
I don't think that's what I was suggesting; the power supply rating is not something I had considered. I guess it could be of interest, but IMO it probably belongs in the module type or inventory item type, since it's not something you can adjust - it's just part of the device specification. I don't think it is very meaningful to sum it across all devices. It's very much a worst-case figure: if you stuffed a single server with the maximum number of GPU cards or hard drives or whatever, whether the PSU itself could cope.
the allocation should be how much power needs to be reserved to run the device off this sole power supply (500W)
However, Netbox's current summing of these will show the device using 500+500=1000W, when it's only using 500W, so as you say, you'd need to exclude the double-counted values somehow. I don't think marking a particular power feed as "redundant" is ideal, because you may have some devices that are not dual-powered. For example, I sometimes have a redundant pair of routers each with single PSU, and connect one to the A feed and one to the B.
I am inclined to say that allocated/expected power draw should be an attribute at the Device level, not at the Power Port level, because in general devices don't have active / standby power supplies, but instead load balance between them. But to model that today in Netbox you need to divide the draw between the power ports.
If you take that view that "allocated draw" is the share of power that the port draws in normal operation, then I think "maximum power draw" should be the worst-case amount in a failure scenario. If the server has 2 power supplies but can run off 1, then it should be the full draw. If the server has 3 power supplies and can run off 2 (but not 1), then it should be half the full draw.
In this case its about modeling allocation for the PDUs so you don't exceed 50% of their capacity and they can provide useful redundancy of a circuit outage.
Exactly: I think that's what I think the "maximum draw" should be aiming to model. "If we lost a PDU, would the second PDU have enough capacity to cope?"
—
Reply to this email directly, view it on GitHub<#12837 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAS7UM2ITTFB557IXANCH4LXKI5ZHANCNFSM6AAAAAAY7JR6V4>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Our goal when modeling power in DCIM is not to see how much power are we using, but to see how close are we to 50% utilization of each PDU fuse. How this looks in practice for us: For example we have 230V PDUs A and B in a rack. Each PDU has two groups with it's own fuse (16A).
The current system kinda works but it's a bit hard to figure out and clumsy to setup. I would also like to have a power utilization set on the device level, and then to distribute the load evenly across all present power supplies by default. For devices that don't have even distribution of a load, we can have a factor setting for each PSU that would control how much of the power should be distributed to that PSU. My biggest issue with the current system is that there are devices which power consumption is highly dependent on the populated bays/modules. For example blade chassis that can hold 16 server blades will have different power usage with 1 blade and all 16 blades. This means that I have to change all of the PSUs every time we add/remove a blade. I think that it would be great to have power consumption for each bay/module (alongside the parent device), aggregate all of that on a device level and then distribute the aggregate value across available power supplies. |
Beta Was this translation helpful? Give feedback.
-
FYI, I've uploaded a custom script here which may be useful when considering power draw at a site level (as opposed to current at the PDU level) It calculates the total power draw across all devices in a site, just by summing allocated power from each device's power port in that site. It doesn't make any use of power feeds, and doesn't use the parent PDU power calculation. In fact, you don't even have to connect any cables to the power ports. It can also count the total power outlets in a site, and how many of those are available (not connected). |
Beta Was this translation helpful? Give feedback.
-
I don't know if this is the right place for my question. As mentioned in the original post In our use case we have four power feeds coming into our rack. Two primary and two redudant ones. Am I doing something wrong or is this how its intended to be? |
Beta Was this translation helpful? Give feedback.
-
There seems to be very little documentation about the power utilization features and calculation in Netbox. Here it says:
And indeed, I see a power utilization figure at rack level. But it doesn't say how it relates to "Maximum draw" and "Allocated draw" on a power port; nor how the "Max utilization (percentage)" on a power feed relates to either "Maximum draw" or "Allocated draw"; nor how you should model devices with multiple PSUs (e.g. if the device average draw is 70W, should you allocate 35W on PSU1 and 35W on PSU2? Of course, if one PSU or feed fails, then you'll get 70W via the other one)
Therefore, I'm attempting to reverse engineer the logic. I'm posting it here so people can comment and/or correct it.
Power Ports
netbox/dcim/models/device_components.py
)netbox/dcim/constants.py
)netbox/dcim/models/device_components.py
:get_downstream_powerports
: on a PDU it finds the PowerOutlets internally linked to this PowerPort, and then any downstream PowerPorts connected by cables. But AFAICS it only looks one level deep, i.e. it doesn't recurse through further PDUsget_power_draw
performs a summary calculation on a (PDU) power portget_downstream_powerports
and sums the maximum_draw and allocated_draw from these. It uses a django.dbaggregate()
with twoSum()
, so I think this is done using the values stored in SQL only (I need to test, if any value is null then will the entire result be null?)At this point, we can set up some simple test cases. I will ignore 3-phase for now. To start:
IN
) and 8 linked power outlets (OUT[1-8]
)Now look at the "TestPDU":
And "TestPDU2":
Hence on TestPDU:
What about the "Available" and "Utilization" columns? These are inserted in the template
netbox/templates/dcim/device.html
, and are only displayed if the powerport is connected to a powerfeed (powerfeed=powerport.connected_endpoints.0
, and additionally this must have an attributeavailable_power
, which is only an attribute of powerfeed not a power outlet)If so, the two columns displayed are
powerfeed.available_power
{% utilization_graph utilization.allocated|percentage:powerfeed.available_power %}
(also if there are three-phase legs, additional table rows are displayed for each leg)
Therefore, next we need to go into power feeds.
Power Feeds
netbox/dcim/models/power.py
,netbox/dcim/choices.py
):abs(self.voltage) * self.amperage * (self.max_utilization / 100)
and then multiplied by 1.732 if three-phase; finally rounded to nearest integerA Power Port can be connected with a cable to a Power Feed, and the usage calculations on the Power Port take into account attributes of the powerfeed.
Let's extend the test data:
Now we see on TESTPDU:
This has calculated
230*15*0.8 = 2760 VA
as the maximum utilization, and the sum of devices connected to directly-connected PDUs as 70VA, hence 2.5% utilization (rounded from 2.536%)We see the same when looking at the Power Feed itself:
Rack-level calculations
If the test-feed is associated with the test-rack, then the test-rack shows some power utilization:
But weirdly, it shows utilization at 2.0% not 2.5%. It is calculated slightly differently, in function
get_power_utilization
in class Rack (netbox/dcim/models/racks.py
):PowerPort
get_power_draw()['allocated']
, divides by available_power_total, multiplies by 100, and rounds down to an integer:return int(allocated_draw / available_power_total * 100)
I think this is a bug: the display of "2.0%" implies that it has at least 1 decimal place of accuracy. In the other places, the percentage is calculated using percentage() in
netbox/utilities/templatetags/helpers.py
which rounds to nearest 0.1%:I will raise this separately (#12838).
If the test-feed is not associated with the rack, then the rack shows 0% power used:
(In principle, it could still sum the power port usage of all devices within that rack, even if they are connected to power feeds in other racks, or even if they are not connected to any power feed - but it doesn't)
As far as I can see, there is no power information displayed at Site or Location level. Therefore, any power feed which is not associated with a rack, does not have its usage counted, aggregated or displayed, except when looking at the Power Feed itself. (Even when you view a Power Panel, you see a list of the Power Feeds it contains, but no aggregate power information)
Best practices?
The question then arises as to how you should properly make use of these features, especially how to assign values to "allocated_draw" and "maximum_draw".
In general, the questions I want to answer from the data fall into two classes:
Specifically then, if you have a device with two power ports, which uses a nominal 500W, how should you model it?
(The above assumes that we're not interested in in-rush current at all, and rather that "maximum_draw" is intended to reflect the draw when a device is running on the minimum number of PSUs, after a failure of a PSU or a PDU feed)
Then the issue is how to report on this.
Firstly, since Netbox doesn't handle second-level PDUs it makes the power-feed level accounting unusable in that situation. Specific examples:
Secondly, there's no site-level reporting, which I would find the most useful.
AFAICS, the best approach for now is to write some custom reports:
If people have other experience of this or better suggestions for how to use it, I'd be very happy to hear them.
Beta Was this translation helpful? Give feedback.
All reactions