Add KEDA HPA TriggerAuthentication and postgresql ScaledObject. #2384

pt247 · 2024-04-08T11:05:32Z

Reference Issues or PRs

What does this implement/fix?

Adds KEDA HPA TriggerAuthentication and PostgreSQL ScaledObject.
In the current develop, one pod for conda-store-worker is always running. With this PR, the
conda-store-worker pods will scale down to zero at the start, scale up with the number of pending/queued builds, and scale down to zero when builds are complete.

Put a x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features not to work as expected)
Documentation Update PR
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe):

Testing

Did you test the pull request locally?
Did you add new tests?: New local integration test added.

Any other comments?

We can now set the following parameters while configuring conda-store:

conda_store:
  max_workers: 50
  worker_resources:
    requests:
      cpu: 1
      memory: 4Gi

The local integration test is slow and adds 10-15 minutes to local integration tests.
We can add follow-up PRs to expose KEDA and KEDA.ScaledObject parameters if needed.

Adam-D-Lewis

We should also set a memory request for the conda store worker deployment (I'd suggest a minimum of 1GiB), and allow the general node group to scale above 1 (e.g. set max_nodes default value to 5) by default for each provider e.g.

nebari/src/_nebari/stages/infrastructure/__init__.py

Line 390 in a5cbb13

"general": AzureNodeGroup(instance="Standard_D8_v3", min_nodes=1, max_nodes=1),

.

Otherwise, if we try to build 100 conda envs simultaneously, we'll get 100 conda worker pods. They will all be scheduled on a single general node (assuming only 1 is running). The general node won't even try to scale up (b/c the conda store worker doesn't have any resource requests set), but the general node won't have enough memory to solve all the envs at the same time and many will fail.

We should also make this part of the next Nebari release to tell existing deployments that it's recommended that they increase the max_nodes setting of the nebari-config if it is set explicitly to 4 or less.

It also might be a good idea to set a maximum number of conda store workers as a safe guard (even if the max number is high). E.g. If 100 envs are all scheduled at once, maybe we should only start a maximum of 50 workers. (In the KEDA sql command maybe take the min of the number of queued/building envs and 50 for example). Ideally that max would be configurable in the nebari config under the conda_store section.

dharhas · 2024-04-09T16:55:52Z

Are there conda env solves that require more than 1 GB of ram? I remember early on we had to keep bumping up the ram on the conda-store main node.

Adam-D-Lewis · 2024-04-09T17:33:50Z

@dharhas

Are there conda env solves that require more than 1 GB of ram? I remember early on we had to keep bumping up the ram on the conda-store main node.

Yes, there are certainly conda solves (most?) that use more than 1 GB of RAM. I don't think there's a way to estimate beforhand how much RAM a conda solve will need. Setting the memory resource request doesn't limit the pod from using more RAM if available on the node, but it would put a limit on the number of conda store workers scheduled on the same node.

pt247 · 2024-04-09T19:12:08Z

It also might be a good idea to set a maximum number of conda store workers as a safe guard (even if the max number is high). E.g. If 100 envs are all scheduled at once, maybe we should only start a maximum of 50 workers. (In the KEDA sql command maybe take the min of the number of queued/building envs and 50 for example). Ideally that max would be configurable in the nebari config under the conda_store section.

Default as per the documentation - link is
maxReplicaCount: 100
@Adam-D-Lewis
I will change it to 50 as suggested.

pt247 · 2024-04-15T07:52:10Z

I have tried to create 100 conda environments for testing, and not all of them transition to a completed state. Some 24 get stuck in a queued state with the following error.

[2024-04-15 00:08:26,319: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 316, in open return _Database(file, mode, flag=flag)                                                                                                                   
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 70, in __init__ self._create(flag)       
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 86, in _create with _io.open(self._datfile, 'w', encoding="Latin-1") as f:                                                                                               
[2024-04-15 00:08:26,320: WARNING/Beat]   FileNotFoundError: [Errno 2] No such file or directory: '/root/.conda-store/celerybeat-schedule.dat'

@dcmcand In this case, should we roll this out based on the manual testing we have done so far?

@Adam-D-Lewis Are you happy with the other changes?

… in too many requests errors in workers.

Adam-D-Lewis · 2024-04-15T17:30:46Z

We should also set a memory request for the conda store worker deployment (I'd suggest a minimum of 1GiB)

@pt247 looks like we still need this part. This could be one of the reasons the conda store workers are failing.

Add lines 116-121 in src/_nebari/stages/kubernetes_services/template/modules/kubernetes/services/conda-store/worker.tf similar to what I've shown here.

dcmcand · 2024-04-16T12:49:09Z

I have tried to create 100 conda environments for testing, and not all of them transition to a completed state. Some 24 get stuck in a queued state with the following error.
[2024-04-15 00:08:26,319: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 316, in open return _Database(file, mode, flag=flag)                                                                                                                   
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 70, in __init__ self._create(flag)       
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 86, in _create with _io.open(self._datfile, 'w', encoding="Latin-1") as f:                                                                                               
[2024-04-15 00:08:26,320: WARNING/Beat]   FileNotFoundError: [Errno 2] No such file or directory: '/root/.conda-store/celerybeat-schedule.dat'    
@dcmcand In this case, should we roll this out based on the manual testing we have done so far?

@Adam-D-Lewis Are you happy with the other changes?
@pt247 After adding the changes that @Adam-D-Lewis requested, can you redo your load test? We need to be able to gracefully scale conda-store workers. Ensuring that we have guidance around what resources/configs are needed for what scale would be important too. Finally, please also open a docs PR to document how to configure this.

tests/tests_deployment/test_conda_store_scaling.py

tests/tests_deployment/test_jupyterhub_api.py

src/_nebari/stages/infrastructure/__init__.py

...nebari/stages/kubernetes_services/template/modules/kubernetes/services/conda-store/worker.tf

Adam-D-Lewis

I don't think you've addressed any of my comments unless I'm missing something.

Adam-D-Lewis · 2024-06-06T22:56:17Z

Also, I'm running off a modified branch starting from yours so there could be some differences, but it appears that the conda-store workers won't actually run on more than 1 node due to the pvc that they are using being ReadWriteOnce. They will just stay pending even if another node exists for them to run on.

I'm not familiar enough with how conda store works (e.g. how does the environment gets from the conda store worker storage to the conda store nfs drive) so I'm not sure what's the best way to solve this.

…the other cloud providers.

Adam-D-Lewis · 2024-06-11T15:52:33Z

I've documented the scaling issue at here

Adam-D-Lewis · 2024-06-11T16:26:48Z

@pt247 For this PR, I suggest we just use KEDA to only scale up to 4 workers as we do currently, but also have the conda-store workers scale down to 0 when not in use to relieve the memory leak behavior seen here. We need to test (manually) that conda envs are still mounted to the user pod when no conda store workers are running. I'm not sure if that'll behave properly if the nfs server isn't running.

pt247 · 2024-06-11T18:39:03Z

@Adam-D-Lewis

@pt247 For this PR, I suggest we just use KEDA to only scale up to 4 workers as we do currently, but also have the conda-store workers scale down to 0 when not in use to relieve the memory leak behavior seen here. We need to test (manually) that conda envs are still mounted to the user pod when no conda store workers are running. I'm not sure if that'll behave properly if the nfs server isn't running.

While doing a fresh deploy I came across a worker that won't scale down, it seems be trying to build one of the shared environments:

Logs show this warning:

│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:12,543: WARNING/ForkPoolWorker-2] /opt/conda/envs/co │
│ nda-store-server/lib/python3.10/site-packages/conda_store_server/build.py:353: UserWarning: Generating Docker images is currently no │
│ t supported, see https://github.com/conda-incubator/conda-store/issues/666                                                           │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker   warnings.warn(                                                       │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker                                                                        │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:12,544: INFO/ForkPoolWorker-2] Task task_build_conda │
│ _docker[build-2-docker] succeeded in 0.015599027999996906s: None                                                                     │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:15,151: INFO/ForkPoolWorker-2] Task task_build_conda │
│ _env_export[build-2-conda-env-export] succeeded in 2.6055338719997962s: None                                                         │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:15,151: INFO/MainProcess] Task task_update_storage_m │
│ etrics[build-2] received                                                                                                             │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:15,159: INFO/ForkPoolWorker-2] Task task_update_stor │
│ age_metrics[build-2] succeeded in 0.006750864999958139s: None

It looks like KEDA won't help in this case, as the build never finishes.

pt247 · 2024-06-11T19:07:34Z

@Adam-D-Lewis I have reduced the default value of max conda-store workers to 4. User can still increase this number from the config. I can add some docs faq about it once this PR is in. Alternatively, I can simply remove that configuration and hard code it to 4. I would prefer the user had an option. What do you think?

Adam-D-Lewis · 2024-06-11T19:08:40Z

@Adam-D-Lewis

@pt247 For this PR, I suggest we just use KEDA to only scale up to 4 workers as we do currently, but also have the conda-store workers scale down to 0 when not in use to relieve the memory leak behavior seen here. We need to test (manually) that conda envs are still mounted to the user pod when no conda store workers are running. I'm not sure if that'll behave properly if the nfs server isn't running.

While doing a fresh deploy I came across a worker that won't scale down, it seems be trying to build one of the shared environments:

Logs show this warning:

│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:12,543: WARNING/ForkPoolWorker-2] /opt/conda/envs/co │
│ nda-store-server/lib/python3.10/site-packages/conda_store_server/build.py:353: UserWarning: Generating Docker images is currently no │
│ t supported, see https://github.com/conda-incubator/conda-store/issues/666                                                           │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker   warnings.warn(                                                       │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker                                                                        │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:12,544: INFO/ForkPoolWorker-2] Task task_build_conda │
│ _docker[build-2-docker] succeeded in 0.015599027999996906s: None                                                                     │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:15,151: INFO/ForkPoolWorker-2] Task task_build_conda │
│ _env_export[build-2-conda-env-export] succeeded in 2.6055338719997962s: None                                                         │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:15,151: INFO/MainProcess] Task task_update_storage_m │
│ etrics[build-2] received                                                                                                             │
│ nebari-conda-store-worker-5f5579fd9d-lgb9l conda-store-worker [2024-06-11 17:43:15,159: INFO/ForkPoolWorker-2] Task task_update_stor │
│ age_metrics[build-2] succeeded in 0.006750864999958139s: None

It looks like KEDA won't help in this case, as the build never finishes.

Are you sure the issue with a build never finishing is related. I can't see how it's related. What I'd like to see is spin down all the conda store workers after having built some envs, then start up a new jupyter user pod and run something using one of the built environments (while no conda store workers are running).

Adam-D-Lewis · 2024-06-11T22:00:29Z

Update from our meeting: looks like we can't scale down to 0 conda store workers b/c the nfs server is on the same pod as the conda store worker container and the jupyter user pods will fail to run if the conda store nfs server isn't running.

pt247 · 2024-06-12T09:51:44Z

It turns out we can not scale conda-store workers to zero because it also hosts the nfs-share that is needed for any jupyter notebook to start. Keeping at least one worker alive could be a solution. Additionally we can not scale conda-store workers beyond single node as the nfs share RWO and associated with PVC. As there is no advantage of getting in implimenting this PR, I am closing this PR.

pt247 and others added 2 commits April 8, 2024 12:02

Add KEDA HPA TriggerAuthentication and postgresql ScaledObject.

90532c6

[pre-commit.ci] Apply automatic pre-commit fixes

57c188f

pt247 requested a review from viniciusdc April 8, 2024 11:06

pt247 marked this pull request as ready for review April 8, 2024 11:07

Terraform fmt.

bf87d9d

pt247 requested a review from dcmcand April 8, 2024 11:20

pt247 and others added 6 commits April 9, 2024 14:20

More reactive scale up and down.

6b1e7ff

Merge branch 'develop' into 2284-keda-conda-store-worker-hpa

e57aa74

Formatting changes.

d62ecea

Formating changes.

6dbfcef

Tweak default parameters.

d07928c

Code refactor.

b1be5b5

Adam-D-Lewis self-requested a review April 9, 2024 16:01

Adam-D-Lewis requested changes Apr 9, 2024

View reviewed changes

pt247 and others added 6 commits April 9, 2024 20:56

Set max nodes of general node to 5.

c6a38cb

Add node affinity for KEDA pods to general node.

5d0607c

Set maxReplicaCount for conda worker scaling.

4d61350

Move keda resources to conda.

bbae748

[pre-commit.ci] Apply automatic pre-commit fixes

bcb8a82

Fix variable discription.

4106d7d

pt247 and others added 2 commits April 15, 2024 08:55

Keeping default as more aggressive polling of postgresql is resulting…

22e1fbe

… in too many requests errors in workers.

Merge branch 'develop' into 2284-keda-conda-store-worker-hpa

bd5b051

pt247 requested a review from Adam-D-Lewis April 15, 2024 08:03

Add resource limits for conda pods.

701eb6e

Adam-D-Lewis reviewed Jun 6, 2024

View reviewed changes

tests/tests_deployment/test_conda_store_scaling.py Outdated Show resolved Hide resolved

Adam-D-Lewis reviewed Jun 6, 2024

View reviewed changes

tests/tests_deployment/test_jupyterhub_api.py Show resolved Hide resolved

Adam-D-Lewis reviewed Jun 6, 2024

View reviewed changes

src/_nebari/stages/infrastructure/__init__.py Show resolved Hide resolved

Adam-D-Lewis reviewed Jun 6, 2024

View reviewed changes

...nebari/stages/kubernetes_services/template/modules/kubernetes/services/conda-store/worker.tf Outdated Show resolved Hide resolved

pt247 requested a review from Adam-D-Lewis June 6, 2024 08:03

This comment was marked as duplicate.

Sign in to view

Adam-D-Lewis requested changes Jun 6, 2024

View reviewed changes

pt247 and others added 11 commits June 7, 2024 14:28

Merge branch 'nebari-dev:develop' into 2284-keda-conda-store-worker-hpa

2eba1bf

KEDA scaling based on conda-store API.

95c84b4

Merge branch 'develop' into 2284-keda-conda-store-worker-hpa

7fd2670

Cleanup tests.

9b5b0e7

Fix conda-store-worker terrafrom file format and syntax.

44ac977

Update Azure general node group max nodes to 5 to be consistent with …

cdad85f

…the other cloud providers.

Make verbose conda-store-worker logs ad debug.

96b70d6

Fix typo.

d4378b3

Cleanup KEDA scaleed object config.

2e7b1b6

[pre-commit.ci] Apply automatic pre-commit fixes

2c39a05

Terrafrom fmt.

142633c

Reduce default max workers to 4.

c72f95a

pt247 requested a review from Adam-D-Lewis June 11, 2024 19:06

pt247 closed this Jun 12, 2024

pt247 mentioned this pull request Jun 12, 2024

[ENH] - Add kubernetes horizontal autoscaler for conda-store workers based on queue depth #2284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KEDA HPA TriggerAuthentication and postgresql ScaledObject. #2384

Add KEDA HPA TriggerAuthentication and postgresql ScaledObject. #2384

pt247 commented Apr 8, 2024 •

edited

Loading

Adam-D-Lewis left a comment •

edited

Loading

dharhas commented Apr 9, 2024

Adam-D-Lewis commented Apr 9, 2024 •

edited

Loading

pt247 commented Apr 9, 2024 •

edited

Loading

pt247 commented Apr 15, 2024

Adam-D-Lewis commented Apr 15, 2024 •

edited

Loading

dcmcand commented Apr 16, 2024

This comment was marked as duplicate.

Adam-D-Lewis left a comment

Adam-D-Lewis commented Jun 6, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 11, 2024

Adam-D-Lewis commented Jun 11, 2024 •

edited

Loading

pt247 commented Jun 11, 2024

pt247 commented Jun 11, 2024

Adam-D-Lewis commented Jun 11, 2024

Adam-D-Lewis commented Jun 11, 2024

pt247 commented Jun 12, 2024

Add KEDA HPA TriggerAuthentication and postgresql ScaledObject. #2384

Add KEDA HPA TriggerAuthentication and postgresql ScaledObject. #2384

Conversation

pt247 commented Apr 8, 2024 • edited Loading

Reference Issues or PRs

What does this implement/fix?

Testing

Any other comments?

Adam-D-Lewis left a comment • edited Loading

Choose a reason for hiding this comment

dharhas commented Apr 9, 2024

Adam-D-Lewis commented Apr 9, 2024 • edited Loading

pt247 commented Apr 9, 2024 • edited Loading

pt247 commented Apr 15, 2024

Adam-D-Lewis commented Apr 15, 2024 • edited Loading

dcmcand commented Apr 16, 2024

This comment was marked as duplicate.

Adam-D-Lewis left a comment

Choose a reason for hiding this comment

Adam-D-Lewis commented Jun 6, 2024 • edited Loading

Adam-D-Lewis commented Jun 11, 2024

Adam-D-Lewis commented Jun 11, 2024 • edited Loading

pt247 commented Jun 11, 2024

pt247 commented Jun 11, 2024

Adam-D-Lewis commented Jun 11, 2024

Adam-D-Lewis commented Jun 11, 2024

pt247 commented Jun 12, 2024

pt247 commented Apr 8, 2024 •

edited

Loading

Adam-D-Lewis left a comment •

edited

Loading

Adam-D-Lewis commented Apr 9, 2024 •

edited

Loading

pt247 commented Apr 9, 2024 •

edited

Loading

Adam-D-Lewis commented Apr 15, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 6, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 11, 2024 •

edited

Loading