Skip to content

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Sep 25, 2025

Need to be able to control nofile and memory ulimits for the container for AWS Batch: https://docs.aws.amazon.com/batch/latest/APIReference/API_Ulimit.html

Test plan:
[x] Added a unit test
[x] Tested via

torchx run --scheduler aws_batch --workspace="" --scheduler_args 'queue=<queue-name>,ulimits="name=nofile,softLimit=65536,hardLimit=65536",priority=1' utils.sh -h g5.4xlarge --image alpine:latest -- ulimit -n

Produces:

...
          "mountPoints": [],
          "privileged": false,
          "ulimits": [
            {
              "hardLimit": 65536,
              "name": "nofile",
              "softLimit": 65536
            }
          ],
          "instanceType": "g5.4xlarge",
          "resourceRequirements": [
            {
              "value": "16",
              "type": "VCPU"
            },
            {
              "value": "62912",
              "type": "MEMORY"
            },
            {
              "value": "1",
              "type": "GPU"
            }
          ],
          "linuxParameters": {
            "devices": [],
            "sharedMemorySize": 62912,
            "tmpfs": []
          },
...

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 25, 2025
@clumsy
Copy link
Contributor Author

clumsy commented Sep 25, 2025

For your consideration, @kiukchung @andywag @d4l3k

@kiukchung
Copy link
Contributor

kiukchung commented Sep 25, 2025

@clumsy is the motivation here to increase or decrease ulimit -n?

@clumsy
Copy link
Contributor Author

clumsy commented Sep 25, 2025

That's correct, @kiukchung. Beyond the default value

@kiukchung
Copy link
Contributor

That's correct, @kiukchung. Beyond the default value

What happens if we just make ulimit default to 65536 instead of adding a scheduler option?

@clumsy
Copy link
Contributor Author

clumsy commented Sep 25, 2025

The default for Batch is usually 65536 (hard) and 32768 (soft) and it's possible to override it but only by using privileged users, non-root users can't change it later on, @kiukchung :

ulimit -a

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 30446
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 65534
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 10240
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

ulimit -n 100000

bash: ulimit: open files: cannot modify limit: Operation not permitted

but later
sudo su -
ulimit -n 100000
ulimit -a

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 30446
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 100000
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 10240
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.63%. Comparing base (b72ba03) to head (f299798).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
torchx/schedulers/aws_batch_scheduler.py 93.33% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1127   +/-   ##
=======================================
  Coverage   91.63%   91.63%           
=======================================
  Files          83       83           
  Lines        6392     6409   +17     
=======================================
+ Hits         5857     5873   +16     
- Misses        535      536    +1     
Flag Coverage Δ
unittests 91.63% <93.33%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clumsy clumsy force-pushed the feat/aws_batch_ulimits branch from 01ef2f6 to b96a1e7 Compare September 25, 2025 22:06
@clumsy clumsy force-pushed the feat/aws_batch_ulimits branch from b96a1e7 to 4fd830b Compare September 25, 2025 23:43
@kiukchung
Copy link
Contributor

@clumsy could you take a look at the failing pyre checks?

torchx/schedulers/aws_batch_scheduler.py:551:42 Incompatible parameter type [6]: In call `parse_ulimits`, for 1st positional argument, expected `List[str]` but got `Union[List[Variable[_T]], str]`.
torchx/schedulers/test/aws_batch_scheduler_test.py:414:12 Undefined attribute [16]: `object` has no attribute `__getitem__`.

don't worry about the docs build.

@clumsy clumsy force-pushed the feat/aws_batch_ulimits branch from 4fd830b to f299798 Compare September 30, 2025 17:20
@clumsy
Copy link
Contributor Author

clumsy commented Sep 30, 2025

Fixed the warnings @kiukchung. I changed the return type of _node_properties from Dict[str, object] to Dict[str, Any] to fix 3 other Pyre warnings/ignores.

I'm not sure what was the reason to use an opaque object type in the return type. Also Pyre can't handle list apparently so I had to change it to List in Opts so that it can verify it has __get_item__ method.

@kiukchung
Copy link
Contributor

Fixed the warnings @kiukchung. I changed the return type of _node_properties from Dict[str, object] to Dict[str, Any] to fix 3 other Pyre warnings/ignores.

I'm not sure what was the reason to use an opaque object type in the return type. Also Pyre can't handle list apparently so I had to change it to List in Opts so that it can verify it has __get_item__ method.

object is narrower than Any. Hmmm the list vs List is weird... might have to do with the way the class runopt works.

@clumsy
Copy link
Contributor Author

clumsy commented Sep 30, 2025

Fixed the warnings @kiukchung. I changed the return type of _node_properties from Dict[str, object] to Dict[str, Any] to fix 3 other Pyre warnings/ignores.
I'm not sure what was the reason to use an opaque object type in the return type. Also Pyre can't handle list apparently so I had to change it to List in Opts so that it can verify it has __get_item__ method.

object is narrower than Any. Hmmm the list vs List is weird... might have to do with the way the class runopt works.

It is narrower, just not clear why it's used for this return type. The code is now equivalent but with no warnings/errors and ignores. Let me know if you want me to revert of the changes, @kiukchung

@kiukchung kiukchung merged commit fd5c7d4 into meta-pytorch:main Oct 1, 2025
18 of 19 checks passed
@clumsy clumsy deleted the feat/aws_batch_ulimits branch October 1, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants