Bug 1375350 Edits to OOR info#3023
Conversation
There was a problem hiding this comment.
Would it be easier to say "The pod will fail"? I feel like otherwise, we'd need to say "the *PodPhase* is transitioned to Failed in the X file".
There was a problem hiding this comment.
a node does not lose resources, so i think that phrasing is awkward. maybe, in cases where a node is running low on available resources...
The upstream documentation I wrote (which this copied) was crisp on what it meant to fail a pod as it caused confusion. I would prefer we keep that description.
There was a problem hiding this comment.
I do think this could be changed to something like "How the node signals that it's almost full", (though that's a terrible suggestion). But really, the paragraph below seems to be describing a setting that says that. Any other suggestions?
There was a problem hiding this comment.
I would prefer to not change. The list of signals will grow, and fullness is not a term used in this area.
There was a problem hiding this comment.
"The node can support the ability"
Does this mean it is not enabled by default? Is this what we're enabling here?
There was a problem hiding this comment.
in 3.3, they are not enabled by default.
in 3.4, we will set some default values for memory.
in 3.5, we will set some default values for disk related resources.
There was a problem hiding this comment.
Changed to "the node can be configured to..." cos 3.3 is the current.
There was a problem hiding this comment.
Where are these signals actually given out? Is this in the logs? Which files exactly?
There was a problem hiding this comment.
the signals are calculated from the summary stats api on the node.
the user can invoke that api by calling:
curl <certificate details> https://<master>/api/v1/nodes/<node>/proxy/stats/summary
the right hand side of these equations are literally taken from those values in the API response above.
There was a problem hiding this comment.
I think what's missing is an explanation of what an eviction threshold. Why you would want to use a soft one over a hard one?
There was a problem hiding this comment.
an example for a soft eviction and hard eviction threshold would be the following:
- operator wants to evict immediately if memory falls below <5% (hard threshold)
- operator wants to evict if memory falls below <30% for 1 min (soft threshold)
in this scenario, the operator would like their machines to steady at 70% utilization, but are willing to go above that for short periods of time. that is the general idea. our ops team will typically use two thresholds for a similar reason.
There was a problem hiding this comment.
With the below, and this is said a few times, but we're saying "The node supports the ability to X" a lot. I take that to mean something needs to be enabled in order to use it. Is that the case? If it's by default, it should just be saying "The node can X".
There was a problem hiding this comment.
we have no default eviction thresholds enabled in 3.3, so a user today must opt-in to this behavior. Meaning they need to setup the node to do this now. we will get defaults in the future.
There was a problem hiding this comment.
Where is this found? A file somewhere?
There was a problem hiding this comment.
this is basically describing a syntax for a threshold.
in 3.3, we just support literal thresholds (memory.available<100Mi)
in 3.4, we will support percentage thresholds as well (memory.available<10%)
so in 3.4, we will need to update this doc to reflect that both input styles are valid.
a sample node configuration is given in the "Example scenario" section.
There was a problem hiding this comment.
How does this tie into the above? There's nothing to really explain that at all.
There was a problem hiding this comment.
its meant to mean the signals in table 1 (which now lists one), but in 3.4 needs to list 4 more (all disk related)
There was a problem hiding this comment.
What does that mean? Is that the only valid value for the bit?
There was a problem hiding this comment.
right now, < is the only supported operator, but there was discussion on potentially offering more. i kept the doc this way to allow it to grow without major re-writes.
There was a problem hiding this comment.
How do you find that? Quantity of what?
There was a problem hiding this comment.
this basically means 'the syntax you used to express a quantity anywhere in openshift/kube' - so whether its how you declare the amount of cpu or memory for a pod, a constraint on quota, a limit in a limit range, etc. they all use the quantity representation. maybe its just obvious. we don't appear to have a good doc to describe:
https://github.com/kubernetes/kubernetes/blob/master/docs/design/resources.md#resource-quantities
There was a problem hiding this comment.
OK. I get you. I made a link out to the docs above, but you're right, it might be a good idea for us on docs to doc this at some point.
There was a problem hiding this comment.
With the below, I think an example would be a lot better... @derekwaynecarr Do you have one we could put into the docs?
There was a problem hiding this comment.
can we hold on a soft eviction scenario until we have disk? i think that will make more sense in that context for 3.4.
There was a problem hiding this comment.
Where is the "Housekeeping" interval?
There was a problem hiding this comment.
As noted earlier, I think we should just say:
"The node evaluates eviction thresholds every 10s."
In the future, hard eviction thresholds for memory will not use polling every 10s, and instead we will have the kernel tell us the threshold has been passed and act immediately. That is planned for Kubernetes 1.5 / Origin 1.5.
There was a problem hiding this comment.
so drop any mention of housekeeping interval.
There was a problem hiding this comment.
I did some minor rewrites so that it's specified that 10 seconds is the housekeeping interval.
There was a problem hiding this comment.
Scratch that, after your next couple comments I scrapped it all instead.
There was a problem hiding this comment.
Is cAdvisor something supported by OpenShift? I've not heard of it before. Is there a better place we could link out to?
There was a problem hiding this comment.
we should drop housekeeping-interval from this document. its hard-coded, and was included here in error.
There was a problem hiding this comment.
I'm not sure what this is saying. Is it saying that the scheduler can read the eviction signal and do something accordingly? Is so, I'd move this above to the signal section.
There was a problem hiding this comment.
I would not move this to the signal section.
The list of reported Node conditions will grow in 3.4 to include DiskPressure.
This table is saying the following:
The scheduler looks at the NodeConditions reported by the node, and if it sees the node reporting "MemoryPressure" it will not place BestEffort pods on that node.
In 3.4, if the scheduler sees nodes that report "DiskPressure", it will not schedule any pods to that node.
The list of pressure conditions will grow, and the scheduler will do something slightly different for each.
There was a problem hiding this comment.
so the scheduler does NOT read eviction signals, it reads Node Conditions that are driven based on the configured eviction thresholds. So for example, if i set an eviction threshold like the following:
eviction-hard is "memory.available<500Mi"
if available memory falls below that value, the Node has a value reported in Node.Status.Conditions[] whose Type will be MemoryPressure and whose Status will be True. It's that value on the node object that the scheduler integrates with when making scheduling decisions.
admin_guide/overcommit.adoc
Outdated
There was a problem hiding this comment.
@derekwaynecarr This seems odd. It's as though having it enabled is a bad thing. Is there a reason why it's not just disabled by default instead of the user needing to do this??
There was a problem hiding this comment.
It is a bad thing, but some customers want to have it enabled because they used that feature to meet certain densities in v2. unfortunately, swap being enabled means you can use other features.
@sdodson -- would it be bad for us to disable swap by default in our install, and instead write doc to discuss how it could be turned back on if desired? is that something we can look to do in 3.4/3.5?
There was a problem hiding this comment.
Yeah, it's not hard, just consensus building. We'll try to get it in the first update after 3.4. https://trello.com/c/vGmZYJ79/296-disable-swap-at-install-and-upgrade
There was a problem hiding this comment.
Sounds good. The initial BZ is about this contradiction, so I'll check if that's enough for Eric ( @TheDiemer ) and continue with the rest of the comments. Thanks, all.
|
@derekwaynecarr ^ bump |
derekwaynecarr
left a comment
There was a problem hiding this comment.
Thanks for improving the documentation, please address the comments.
There was a problem hiding this comment.
a node does not lose resources, so i think that phrasing is awkward. maybe, in cases where a node is running low on available resources...
The upstream documentation I wrote (which this copied) was crisp on what it meant to fail a pod as it caused confusion. I would prefer we keep that description.
There was a problem hiding this comment.
we should drop housekeeping-interval from this document. its hard-coded, and was included here in error.
There was a problem hiding this comment.
I would prefer to not change. The list of signals will grow, and fullness is not a term used in this area.
There was a problem hiding this comment.
in 3.3, they are not enabled by default.
in 3.4, we will set some default values for memory.
in 3.5, we will set some default values for disk related resources.
There was a problem hiding this comment.
basically, I am trying to explain that the only rationale thing to do when a node is running only guaranteed pods but system services are consuming too much resource is to fail the guaranteed pods since i cant really fail node system services.
There was a problem hiding this comment.
I would not move this to the signal section.
The list of reported Node conditions will grow in 3.4 to include DiskPressure.
This table is saying the following:
The scheduler looks at the NodeConditions reported by the node, and if it sees the node reporting "MemoryPressure" it will not place BestEffort pods on that node.
In 3.4, if the scheduler sees nodes that report "DiskPressure", it will not schedule any pods to that node.
The list of pressure conditions will grow, and the scheduler will do something slightly different for each.
There was a problem hiding this comment.
so the scheduler does NOT read eviction signals, it reads Node Conditions that are driven based on the configured eviction thresholds. So for example, if i set an eviction threshold like the following:
eviction-hard is "memory.available<500Mi"
if available memory falls below that value, the Node has a value reported in Node.Status.Conditions[] whose Type will be MemoryPressure and whose Status will be True. It's that value on the node object that the scheduler integrates with when making scheduling decisions.
There was a problem hiding this comment.
s/it/it's
I think this is so important that it should be called out maybe in the top of the document with something like ensuring your node has been configured correctly.
There was a problem hiding this comment.
Agree. I moved this up into the Overview in an admonition.
admin_guide/overcommit.adoc
Outdated
There was a problem hiding this comment.
It is a bad thing, but some customers want to have it enabled because they used that feature to meet certain densities in v2. unfortunately, swap being enabled means you can use other features.
@sdodson -- would it be bad for us to disable swap by default in our install, and instead write doc to discuss how it could be turned back on if desired? is that something we can look to do in 3.4/3.5?
fdb5050 to
cd35c67
Compare
|
@derekwaynecarr Thanks for taking a look. I've made edits, pretty much to what you suggested. Can I get a final ack there's nothing else before I move forward with this? Thanks, again. |
|
@derekwaynecarr ^ Bump. (Thanks!) |
|
@bfallonf -- this is a big improvement. LGTM |
|
Big thanks @derekwaynecarr ! Good to see it's an improvement. Can I ask you to please approve the changes? Seems that new GitHub review feature means people need to give the thumbs up by clicking a button. @adellape @ahardin-rh Any comments before I merge? |
|
@bfallonf -- approved changes, if we can merge this today I can send my updates for the disk-eviction support in 1.4 origin release. |
|
@derekwaynecarr Sure thing. Thanks much. I'll get this merged. If this has been reviewed before tomorrow morning, I can maybe get someone in BNE to take a look. |
There was a problem hiding this comment.
With the updated doc guidelines, we can apply the new formatting here:
`PodPhase`
There was a problem hiding this comment.
Apply updated formatting here:
`Node.Status.Conditions`
`MemoryPressure`
There was a problem hiding this comment.
`oom_score_adj`
there are more instances across the page
|
@bfallonf just a few minor comments from me regarding updated style guidelines. ⭐ |
cd35c67 to
8feb380
Compare
8feb380 to
82e82e3
Compare
|
Thanks @ahardin-rh . I'll pay more attention to the new guidelines... I'll merge away! |
As per: https://bugzilla.redhat.com/show_bug.cgi?id=1375350
But also, I went through all of #2690 because I wanted to get to some of the vague bits.
@derekwaynecarr I'll put some comments in the PR. Can I get your thoughts? And of course if you have any comments on my changes.
Thanks!
cc: @ahardin-rh