docs: add documentation about resource requests #2226

consideRatio · 2021-05-24T16:09:24Z

This documents the current situation on how to set resource requests and closes #2085.

Is the location of this documentation section okay?
Are there something missing from the documentation of relevance to convey?
Is the language okay?

An action point list from #2085 made by @minrk, self-checked by me.

explain a bit of background about resource reservations and when they are useful
identify all components that might want resource requests
identify reasonable values for key pods (mainly hub and proxy) and ranges based on load (e.g. using mybinder.org as a reference point for high load)
- I've provided reasonable recommendations for requests for a reliable configuration.
- I've opted not to try provide a recommendation for a smaller deployment because it is a grey scale and I lack information about it.

manics

Looking good! I made a first pass over it.

doc/source/administrator/optimization.md

manics · 2021-05-24T18:15:07Z

doc/source/administrator/optimization.md

+1. A container running out of memory is typically terminated and restarted as a
+   process is killed by a _Out Of Memory Killer_ (OOMKiller). When this happens


Doesn't the restart of the pod only apply if it's part of a daemonset/deployment?

Hmmmm, not sure if I understand your question, but...

A pod managed by a replicaset or statefulset could be recreated, for example if it is evicted from a node that is about to undergo maintenance of if another pod with higher priority evicts it to be able to be scheduled on a node.

A container can restart independent of having the entire pod be recreated, which is in other words something that can happen with a standalone pod. This is what I'm referring to here.

A pod can have a restartPolicy, and this is a description on how containers should behave. Should they restart on failure / completion etc?

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

doc/source/administrator/optimization.md

Co-authored-by: Simon Li <orpheus+devel@gmail.com>

consideRatio · 2021-05-25T16:25:48Z

@manics thanks for your feedback, I've tried addressing it now! What do you think?

Note that the unrelated CI failure is fixed in #2230.

doc/source/administrator/optimization.md

meeseeksmachine · 2021-05-26T09:05:29Z

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/suggested-cpu-memory-resource-requests-limits/9322/1

minrk

Agreed with @manics that this need not be a blocker for 1.0, but this is already great info as it is, and I'm happy for it to be merged right now and refined.

Tweaking individual numbers for best practices / reasonable limits is something that is always going to evolve over time, and what's here seems like a great start.

minrk · 2021-05-26T11:45:29Z

doc/source/administrator/optimization.md

+   resource _limits_ close to or equal to the resource requests can be useful to
+   make the container operational performance more consistent.
+
+1. Requested memory is reserved and unavailable for other containers on a node


Where did you find this info about memory requests being unavailable? I thought if you had two pods that each request 50% and limit 100% of a node's memory, they would each be able to allocate 100% of the node memory as long as it's not at the same time. But if this is accurate, if requests add up to 100%, no pod will be able to exceed its request at all.

My understanding has been that 'request' is really only an input to the scheduler, while 'limit' is the only input to the runtime (OOMKiller).

Oh oh hmmmmm, you are probably correct in your understanding.

I've not questioned this understanding much, I think it was formulated by myself based on how I figured it was unreasonable that a pod that have requested a set of memory would have to wait for memory to be freed up when it currently was used used by another container. At the same time, that process could be very quick and i bet the OOMKiller can take action before all memory is depleted as well.

Yeah, I think requests are a lot more basic than folks often expect - I think it's purely scheduler bookkeeping, and has no actual effect once pods are scheduled (except possibly in OOMKiller priority when it's time for something to be killed).

Maybe worth just deleting this point then. Then 👍 to merge.

I tested with k3d:

pods.yaml
apiVersion: v1 kind: Pod metadata: name: memory-eater spec: containers: - name: container image: python:3.9-alpine env: - name: PYTHONUNBUFFERED value: "1" command: - python3 - -c - | import os import time MB = 1024 * 1024 chunk_size = 100 * MB # 100 MB chunks = [] while True: chunk = bytearray(chunk_size) chunks.append(chunk) print(f"{len(chunks) * chunk_size // MB }MB") time.sleep(5) resources: requests: memory: 900Mi limits: memory: 1800Mi --- apiVersion: v1 kind: Pod metadata: name: memory-squatter spec: containers: - name: container image: python:3.9-alpine env: - name: PYTHONUNBUFFERED value: "1" command: - sh - -c - while true; do sleep 30; done resources: requests: memory: 900Mi limits: memory: 1800Mi

on a node with 2Gi, and the memory eater is allowed to comfortably allocate almost all of the node's memory before being killed. The presence or absence of the memory squatter's request for 900M doesn't have any effect on how much the eater is able to allocate before being killed.

If there is a point to be made here, I think it's that oversubscribing CPU results in things being slower than they could be, but usually not catastrophic, while oversubscribing memory results in hangs and crashes. The same actually goes for Under-provisioning: low limits on CPU means things might be slow, while low limits on memory means things will keep getting killed.

@minrk thank you!!! This is an excellent summary!

I relied on this insight and made another pass

doc/source/administrator/optimization.md

Co-authored-by: Min RK <benjaminrk@gmail.com>

manics · 2021-05-26T13:48:33Z

How about take out any test we're uncertain about since it's better to say nothing than to give incorrect information, then merge? And refine later?

consideRatio · 2021-05-28T19:32:57Z

I made another pass. Let's make refinements in separate PRs and get this going!

In the pass i made:

Updated some descriptions to reflect new insights, correcting incorrect statements.
Updated resource request suggestions to be sensible assuming you want to fit the core pods onto a 2 CPU node, which i think is a reasonable aim for a z2jh deployment.
Added min/max ranges for the resources usage in various Berkeley deployments across 6 months

docs: add documentation about resource requests

8f13c15

consideRatio added the documentation label May 24, 2021

docs: add high resource requests example

57a5dfa

manics reviewed May 24, 2021

View reviewed changes

consideRatio and others added 2 commits May 24, 2021 23:53

Apply suggestions from code review

0297e02

Co-authored-by: Simon Li <orpheus+devel@gmail.com>

docs: provide a channel for feedback and run autoformatter

041d37f

consideRatio mentioned this pull request May 25, 2021

Add changelog for 1.0.0-beta.1 #2175

Merged

6 tasks

consideRatio added this to the 1.0.0 milestone May 25, 2021

docs: rephrase language for comprehension

1a7bc6d

consideRatio force-pushed the pr/resource-docs branch from 7821d1b to 1a7bc6d Compare May 25, 2021 16:24

consideRatio commented May 25, 2021

View reviewed changes

doc/source/administrator/optimization.md Outdated Show resolved Hide resolved

consideRatio mentioned this pull request May 26, 2021

Z2JH 1.0.0-beta.1 release sprint - a few action points jupyterhub/team-compass#416

Closed

4 tasks

minrk approved these changes May 26, 2021

View reviewed changes

Apply suggestions from code review

362346b

Co-authored-by: Min RK <benjaminrk@gmail.com>

Finalizing pass on resource docs

61b884d

consideRatio merged commit e9bbbe7 into jupyterhub:main May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add documentation about resource requests #2226

docs: add documentation about resource requests #2226

consideRatio commented May 24, 2021 •

edited

Loading

manics left a comment

manics May 24, 2021

consideRatio May 24, 2021

consideRatio May 25, 2021

consideRatio commented May 25, 2021 •

edited

Loading

meeseeksmachine commented May 26, 2021

minrk left a comment

minrk May 26, 2021

consideRatio May 26, 2021

minrk May 27, 2021

minrk May 27, 2021 •

edited

Loading

consideRatio May 27, 2021

consideRatio May 28, 2021

manics commented May 26, 2021

consideRatio commented May 28, 2021 •

edited

Loading

		1. A container running out of memory is typically terminated and restarted as a
		process is killed by a _Out Of Memory Killer_ (OOMKiller). When this happens

docs: add documentation about resource requests #2226

docs: add documentation about resource requests #2226

Conversation

consideRatio commented May 24, 2021 • edited Loading

manics left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio commented May 25, 2021 • edited Loading

meeseeksmachine commented May 26, 2021

minrk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minrk May 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manics commented May 26, 2021

consideRatio commented May 28, 2021 • edited Loading

consideRatio commented May 24, 2021 •

edited

Loading

consideRatio commented May 25, 2021 •

edited

Loading

minrk May 27, 2021 •

edited

Loading

consideRatio commented May 28, 2021 •

edited

Loading