Skip to content

Comments

Bug 1375350 Edits to OOR info#3023

Merged
bfallonf merged 1 commit intoopenshift:masterfrom
bfallonf:swap-1375350
Nov 8, 2016
Merged

Bug 1375350 Edits to OOR info#3023
bfallonf merged 1 commit intoopenshift:masterfrom
bfallonf:swap-1375350

Conversation

@bfallonf
Copy link

As per: https://bugzilla.redhat.com/show_bug.cgi?id=1375350

But also, I went through all of #2690 because I wanted to get to some of the vague bits.

@derekwaynecarr I'll put some comments in the PR. Can I get your thoughts? And of course if you have any comments on my changes.

Thanks!

cc: @ahardin-rh

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to say "The pod will fail"? I feel like otherwise, we'd need to say "the *PodPhase* is transitioned to Failed in the X file".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a node does not lose resources, so i think that phrasing is awkward. maybe, in cases where a node is running low on available resources...

The upstream documentation I wrote (which this copied) was crisp on what it meant to fail a pod as it caused confusion. I would prefer we keep that description.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this could be changed to something like "How the node signals that it's almost full", (though that's a terrible suggestion). But really, the paragraph below seems to be describing a setting that says that. Any other suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not change. The list of signals will grow, and fullness is not a term used in this area.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The node can support the ability"

Does this mean it is not enabled by default? Is this what we're enabling here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 3.3, they are not enabled by default.

in 3.4, we will set some default values for memory.

in 3.5, we will set some default values for disk related resources.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "the node can be configured to..." cos 3.3 is the current.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these signals actually given out? Is this in the logs? Which files exactly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the signals are calculated from the summary stats api on the node.

the user can invoke that api by calling:

curl <certificate details> https://<master>/api/v1/nodes/<node>/proxy/stats/summary

the right hand side of these equations are literally taken from those values in the API response above.

Copy link
Author

@bfallonf bfallonf Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's missing is an explanation of what an eviction threshold. Why you would want to use a soft one over a hard one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example for a soft eviction and hard eviction threshold would be the following:

  • operator wants to evict immediately if memory falls below <5% (hard threshold)
  • operator wants to evict if memory falls below <30% for 1 min (soft threshold)

in this scenario, the operator would like their machines to steady at 70% utilization, but are willing to go above that for short periods of time. that is the general idea. our ops team will typically use two thresholds for a similar reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the below, and this is said a few times, but we're saying "The node supports the ability to X" a lot. I take that to mean something needs to be enabled in order to use it. Is that the case? If it's by default, it should just be saying "The node can X".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have no default eviction thresholds enabled in 3.3, so a user today must opt-in to this behavior. Meaning they need to setup the node to do this now. we will get defaults in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this found? A file somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically describing a syntax for a threshold.

in 3.3, we just support literal thresholds (memory.available<100Mi)

in 3.4, we will support percentage thresholds as well (memory.available<10%)

so in 3.4, we will need to update this doc to reflect that both input styles are valid.

a sample node configuration is given in the "Example scenario" section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this tie into the above? There's nothing to really explain that at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its meant to mean the signals in table 1 (which now lists one), but in 3.4 needs to list 4 more (all disk related)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that mean? Is that the only valid value for the bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, < is the only supported operator, but there was discussion on potentially offering more. i kept the doc this way to allow it to grow without major re-writes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you find that? Quantity of what?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this basically means 'the syntax you used to express a quantity anywhere in openshift/kube' - so whether its how you declare the amount of cpu or memory for a pod, a constraint on quota, a limit in a limit range, etc. they all use the quantity representation. maybe its just obvious. we don't appear to have a good doc to describe:
https://github.com/kubernetes/kubernetes/blob/master/docs/design/resources.md#resource-quantities

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I get you. I made a link out to the docs above, but you're right, it might be a good idea for us on docs to doc this at some point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the below, I think an example would be a lot better... @derekwaynecarr Do you have one we could put into the docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we hold on a soft eviction scenario until we have disk? i think that will make more sense in that context for 3.4.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the "Housekeeping" interval?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted earlier, I think we should just say:

"The node evaluates eviction thresholds every 10s."

In the future, hard eviction thresholds for memory will not use polling every 10s, and instead we will have the kernel tell us the threshold has been passed and act immediately. That is planned for Kubernetes 1.5 / Origin 1.5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so drop any mention of housekeeping interval.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some minor rewrites so that it's specified that 10 seconds is the housekeeping interval.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, after your next couple comments I scrapped it all instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cAdvisor something supported by OpenShift? I've not heard of it before. Is there a better place we could link out to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should drop housekeeping-interval from this document. its hard-coded, and was included here in error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is saying. Is it saying that the scheduler can read the eviction signal and do something accordingly? Is so, I'd move this above to the signal section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not move this to the signal section.

The list of reported Node conditions will grow in 3.4 to include DiskPressure.

This table is saying the following:

The scheduler looks at the NodeConditions reported by the node, and if it sees the node reporting "MemoryPressure" it will not place BestEffort pods on that node.

In 3.4, if the scheduler sees nodes that report "DiskPressure", it will not schedule any pods to that node.

The list of pressure conditions will grow, and the scheduler will do something slightly different for each.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the scheduler does NOT read eviction signals, it reads Node Conditions that are driven based on the configured eviction thresholds. So for example, if i set an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

if available memory falls below that value, the Node has a value reported in Node.Status.Conditions[] whose Type will be MemoryPressure and whose Status will be True. It's that value on the node object that the scheduler integrates with when making scheduling decisions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr This seems odd. It's as though having it enabled is a bad thing. Is there a reason why it's not just disabled by default instead of the user needing to do this??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad thing, but some customers want to have it enabled because they used that feature to meet certain densities in v2. unfortunately, swap being enabled means you can use other features.

@sdodson -- would it be bad for us to disable swap by default in our install, and instead write doc to discuss how it could be turned back on if desired? is that something we can look to do in 3.4/3.5?

Copy link
Member

@sdodson sdodson Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not hard, just consensus building. We'll try to get it in the first update after 3.4. https://trello.com/c/vGmZYJ79/296-disable-swap-at-install-and-upgrade

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. The initial BZ is about this contradiction, so I'll check if that's enough for Eric ( @TheDiemer ) and continue with the rest of the comments. Thanks, all.

@bfallonf
Copy link
Author

@derekwaynecarr ^ bump

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the documentation, please address the comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would stick with low.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a node does not lose resources, so i think that phrasing is awkward. maybe, in cases where a node is running low on available resources...

The upstream documentation I wrote (which this copied) was crisp on what it meant to fail a pod as it caused confusion. I would prefer we keep that description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should drop housekeeping-interval from this document. its hard-coded, and was included here in error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not change. The list of signals will grow, and fullness is not a term used in this area.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 3.3, they are not enabled by default.

in 3.4, we will set some default values for memory.

in 3.5, we will set some default values for disk related resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically, I am trying to explain that the only rationale thing to do when a node is running only guaranteed pods but system services are consuming too much resource is to fail the guaranteed pods since i cant really fail node system services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not move this to the signal section.

The list of reported Node conditions will grow in 3.4 to include DiskPressure.

This table is saying the following:

The scheduler looks at the NodeConditions reported by the node, and if it sees the node reporting "MemoryPressure" it will not place BestEffort pods on that node.

In 3.4, if the scheduler sees nodes that report "DiskPressure", it will not schedule any pods to that node.

The list of pressure conditions will grow, and the scheduler will do something slightly different for each.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the scheduler does NOT read eviction signals, it reads Node Conditions that are driven based on the configured eviction thresholds. So for example, if i set an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

if available memory falls below that value, the Node has a value reported in Node.Status.Conditions[] whose Type will be MemoryPressure and whose Status will be True. It's that value on the node object that the scheduler integrates with when making scheduling decisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it/it's

I think this is so important that it should be called out maybe in the top of the document with something like ensuring your node has been configured correctly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I moved this up into the Overview in an admonition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad thing, but some customers want to have it enabled because they used that feature to meet certain densities in v2. unfortunately, swap being enabled means you can use other features.

@sdodson -- would it be bad for us to disable swap by default in our install, and instead write doc to discuss how it could be turned back on if desired? is that something we can look to do in 3.4/3.5?

@bfallonf
Copy link
Author

@derekwaynecarr Thanks for taking a look. I've made edits, pretty much to what you suggested. Can I get a final ack there's nothing else before I move forward with this? Thanks, again.

@bfallonf
Copy link
Author

@derekwaynecarr ^ Bump. (Thanks!)

@derekwaynecarr
Copy link
Member

@bfallonf -- this is a big improvement. LGTM

@bfallonf
Copy link
Author

Big thanks @derekwaynecarr ! Good to see it's an improvement. Can I ask you to please approve the changes? Seems that new GitHub review feature means people need to give the thumbs up by clicking a button.

@adellape @ahardin-rh Any comments before I merge?

@derekwaynecarr
Copy link
Member

@bfallonf -- approved changes, if we can merge this today I can send my updates for the disk-eviction support in 1.4 origin release.

@bfallonf
Copy link
Author

bfallonf commented Nov 1, 2016

@derekwaynecarr Sure thing. Thanks much. I'll get this merged. If this has been reviewed before tomorrow morning, I can maybe get someone in BNE to take a look.

cc @adellape @ahardin-rh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the updated doc guidelines, we can apply the new formatting here:

`PodPhase`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply updated formatting here:

`Node.Status.Conditions`
`MemoryPressure`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`MemoryPressure`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`oom_score_adj`

Copy link
Contributor

@ahardin-rh ahardin-rh Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 `oom_score_adj`

there are more instances across the page

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`BestEffort`

@ahardin-rh
Copy link
Contributor

@bfallonf just a few minor comments from me regarding updated style guidelines. ⭐

@bfallonf
Copy link
Author

bfallonf commented Nov 8, 2016

Thanks @ahardin-rh . I'll pay more attention to the new guidelines... I'll merge away!

@bfallonf bfallonf merged commit 611370a into openshift:master Nov 8, 2016
@bfallonf bfallonf deleted the swap-1375350 branch November 8, 2016 04:54
@ahardin-rh ahardin-rh modified the milestones: Next Release, Staging, Published - 11/14/16 Nov 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants