Skip to content

Multiple parallel KRM function evaluator pods for the same KRM functions#336

Merged
efiacor merged 26 commits into
kptdev:mainfrom
nokia:parallel-fn-pods
Mar 9, 2026
Merged

Multiple parallel KRM function evaluator pods for the same KRM functions#336
efiacor merged 26 commits into
kptdev:mainfrom
nokia:parallel-fn-pods

Conversation

@dgyorgy-nokia
Copy link
Copy Markdown
Contributor

Title

Multiple parallel KRM function evaluator pods for the same KRM functions


Description

  • What changed:

    • Allow to run multiple pods for the same KRM function image, if the load requires it.
    • Add new command line parameter to function-runner that sets the maximum allowed parallel pod executions per KRM function type (maxParallelPodsPerFunction) and an another one which maximises the waitlist length of a pod. (maxWaitlistLength).
    • By default the PodEvaluator only starts 1 pod per KRM function type. If a new KRM function evaluation is requested, while another is ongoing for the same KRM function and the maximum allowed parallel pod number hasn't reached yet, then a new pod started.
    • The garbage collector collects idle pods, similarly to how it works now.
  • Why it’s needed:
    Currently the PodEvaluator that manages KRM function evaluator pods and executes KRM functions in them is designed to have 0 or 1 pod at any time for each KRM function image name. This limits the scalability of KRM function evaluation, thus this change intends to relax this limitation, by allowing multiple pods running at the same time for the same KRM function image name.

  • How it works:

    • there is a separate waitlist per KRM function per running pod.
    • the KRM function evaluation requests are sent to a single pod one-by-one, not in parallel
    • when a new function evaluation request is received, it picks the shortest waitlist for the given KRM function. If there are multiple shortest waitlists, then picks the one with the lowest index in the slice of waitlists.
    • when all waitlists of a KRM function is longer than a constant value (maxWaitlistLength parameter), and the maximum of allowed parallel pods () per KRM function haven't reached yet, than a new pod started, and a waitlists created for it.
    • when a waitlist of a pod is empty for more than the given TTL period, then deletes that pod and its waitlist

Type of Change

  • Bug fix
  • New feature
  • Enhancement
  • Refactor
  • Documentation
  • Tests
  • Other: ________

Checklist

  • Code follows project style guidelines
  • Self-reviewed changes
  • Tests added/updated
  • Documentation added/updated
  • All tests and gating checks pass

@nephio-prow
Copy link
Copy Markdown
Contributor

nephio-prow Bot commented Oct 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgyorgy-nokia
Once this PR has been reviewed and has the lgtm label, please assign efiacor for approval by writing /assign @efiacor in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
58.3% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@dgyorgy-nokia dgyorgy-nokia marked this pull request as draft October 15, 2025 17:55
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
58.3% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@liamfallon
Copy link
Copy Markdown
Collaborator

I think a PR was merged recently that ignored that linting error. If you rebase this PR, it may remove the linting error.

@thc1006
Copy link
Copy Markdown
Contributor

thc1006 commented Jan 22, 2026

Dear @dgyorgy-nokia,

I've been looking into this PR as I'm also interested in the parallel pod feature for our use case. While reviewing the code, I came across something that might cause issues with multiple pods.

In func/internal/podmanager.go around line 790, the endpoint validation only checks the first address:

endpointIP := endpoint.Subsets[0].Addresses[0].IP

When a second pod joins the same Service, its IP would appear at a different index in the Addresses slice, so this check would fail for any pod that isn't first in the list.

I think changing it to search through all addresses would fix this:

found := false
for _, addr := range endpoint.Subsets[0].Addresses {
    if addr.IP == podIP {
        found = true
        break
    }
}
if !found {
    return false, fmt.Errorf("pod IP %s not found in service endpoints", podIP)
}

By the way, the same pattern exists in main branch (podevaluator.go:1257), though it doesn't cause problems there since only one pod per function is currently supported.

Let me know if I'm missing something or if you'd like me to help with a fix.

@dgyorgy-nokia
Copy link
Copy Markdown
Contributor Author

Dear @dgyorgy-nokia,

I've been looking into this PR as I'm also interested in the parallel pod feature for our use case. While reviewing the code, I came across something that might cause issues with multiple pods.

In func/internal/podmanager.go around line 790, the endpoint validation only checks the first address:

endpointIP := endpoint.Subsets[0].Addresses[0].IP

When a second pod joins the same Service, its IP would appear at a different index in the Addresses slice, so this check would fail for any pod that isn't first in the list.

I think changing it to search through all addresses would fix this:

found := false
for _, addr := range endpoint.Subsets[0].Addresses {
    if addr.IP == podIP {
        found = true
        break
    }
}
if !found {
    return false, fmt.Errorf("pod IP %s not found in service endpoints", podIP)
}

By the way, the same pattern exists in main branch (podevaluator.go:1257), though it doesn't cause problems there since only one pod per function is currently supported.

Let me know if I'm missing something or if you'd like me to help with a fix.

Dear @thc1006 ,

Thanks for the feedback!

The idea here to have service-pod pairs, which means that one service should have only one pod.

The function runner takes care of distributing the request across the krm function pods. Two new parameters have been introduced for this. (MaxWaitlistLength/MaxParallelPodsPerFunction)

There is a waitlist for each started service-pod pair. In case of this waitlist reaches the value in the MaxWaitlistLength parameter and the configured value in MaxParallelPodsPerFunction is bigger than the actual number of pods, a new pod-service pair will be created. With multiple running pods, the incoming request will be handled by the pod with the lowest waitlist.

It can also be configured to behave the same way as it does now. In case of the MaxParallelPodsPerFunction configured to one, only one pod will handle the requests. This way the MaxWaitlistLength setting is ignored and requests are routed to there, even if the limit is exceeded.

There are a few more changes I need to make. I'll update this PR as soon as I can.

@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 12, 2026

Deploy Preview for porch ready!

Name Link
🔨 Latest commit 264ec44
🔍 Latest deploy log https://app.netlify.com/projects/porch/deploys/69ac9079a4c91100080cfec8
😎 Deploy Preview https://deploy-preview-336--porch.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@efiacor
Copy link
Copy Markdown
Collaborator

efiacor commented Feb 23, 2026

@dgyorgy-nokia
Copy link
Copy Markdown
Contributor Author

Hi @dgyorgy-nokia,

Thanks for adding the podcachemanager tests! noticed the podmanager_unit_test.go from the gist didn't make it in tho — was that intentional or just missed?

also I have some more tests ready covering EvaluateFunction paths and event loop edge cases that should help push the coverage closer to 80%. lmk if u want me to share them and I'll put up another gist~

thx!

Hi @thc1006 ,
Sorry, I just missed the second test file, I’ve already pushed that one as well.
If there are further tests, please feel free to share them.
Thanks!

@thc1006
Copy link
Copy Markdown
Contributor

thc1006 commented Feb 23, 2026

Hi @dgyorgy-nokia,

thx for pushing the second file! here's batch 2: https://gist.github.com/thc1006/3c0ab87ea258dbe1d1406daf2cd25c85

two files:

  • podevaluator_unit_test.go — 6 tests covering EvaluateFunction paths (success, grpc errors, context cancel, etc.)
  • podcachemanager_eventloop_test.go — 10 tests covering the event loop, warmupCache, and retrieveFunctionPods edge cases

one small note — i noticed podCacheManager now takes a context.Context, so i updated the goroutine calls from go pcm.podCacheManager() to go pcm.podCacheManager(t.Context()). verified everything passes against the current tip of your branch.

hope these help push coverage over 80%!

@thc1006
Copy link
Copy Markdown
Contributor

thc1006 commented Feb 23, 2026

re efiacor's docs comment — looked at both pages and the config page is missing --max-parallel-pods-per-function and --max-waitlist-length entirely. here's what i'd add to the Pod Runtime Arguments section:

- --max-parallel-pods-per-function=1  # max pods per function image (default: 1)
- --max-waitlist-length=2             # concurrent evaluations per pod before spawning a new pod (default: 2)

one thing worth noting in the description: --max-waitlist-length counts both queued and actively running evaluations (execution inside each pod is serialized via fnEvaluationMutex), so it's a load-pressure threshold rather than a strict queue cap. also only takes effect when --max-parallel-pods-per-function > 1. same two flags should go into the Complete Example block as well.

lmk if a full diff would be easier and i'll put one together.

@liamfallon
Copy link
Copy Markdown
Collaborator

Quality Gate Passed Quality Gate passed

Issues 16 New issues 0 Accepted issues

Measures 0 Security Hotspots 81.2% Coverage on New Code 0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Nice!

@thc1006
Copy link
Copy Markdown
Contributor

thc1006 commented Feb 23, 2026

Congrats!

@nagygergo
Copy link
Copy Markdown
Contributor

nagygergo commented Feb 25, 2026

@dgyorgy-nokia, in the default k8s manifests function-runner is running with 2 replicas, but as I understand in this PR load balancing is only happening inside the process. Would that mean getting overload on the krm function executors compared to the provided cache configuration?
I think that function-runner should be scaled back to a single replica.

@dgyorgy-nokia
Copy link
Copy Markdown
Contributor Author

@dgyorgy-nokia, in the default k8s manifests function-runner is running with 2 replicas, but as I understand in this PR load balancing is only happening inside the process. Would that mean getting overload on the krm function executors compared to the provided cache configuration? I think that function-runner should be scaled back to a single replica.

Yes, that’s correct. The actual load balancing happens in the PodCacheManager, based on the provided cache configuration, not at the Kubernetes level. So it indeed makes sense to reduce the functionrunner replica count to one.
If that’s fine, I’ll modify the relevant part accordingly.

@mozesl-nokia mozesl-nokia mentioned this pull request Mar 5, 2026
10 tasks
@liamfallon
Copy link
Copy Markdown
Collaborator

@dgyorgy-nokia would you mind rebasing this PR please?

@efiacor
Copy link
Copy Markdown
Collaborator

efiacor commented Mar 6, 2026

Is this PR ready for final review/merge @dgyorgy-nokia @thc1006 ?
Can we do a final rebase?

@thc1006
Copy link
Copy Markdown
Contributor

thc1006 commented Mar 6, 2026

Dear @dgyorgy-nokia ,
I am ok. You make the final decision. ( thx everyone

@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 6, 2026
@dgyorgy-nokia
Copy link
Copy Markdown
Contributor Author

Is this PR ready for final review/merge @dgyorgy-nokia @thc1006 ? Can we do a final rebase?

Sure, it's ready.

Comment thread go.mod Outdated

go 1.25.7

retract v1.3.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this retract? Don't think it's needed.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Mar 7, 2026

Copy link
Copy Markdown
Collaborator

@efiacor efiacor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work guys. Thanks. I suspect we might need some addition updates to the docs as a follow up. Somewhere here maybe - https://docs.porch.nephio.org/docs/5_architecture_and_components/function-runner/

@efiacor efiacor merged commit 47eb46f into kptdev:main Mar 9, 2026
22 checks passed
@dosubot
Copy link
Copy Markdown

dosubot Bot commented Mar 9, 2026

Documentation Updates

2 document(s) were updated by changes in this PR:

_index
View Changes
@@ -25,13 +25,14 @@
 ### Pod Lifecycle Management
 
 Manages function execution pods with caching and garbage collection:
-- **Pod Cache Manager**: Orchestrates pod lifecycle via channel-based communication
+- **Pod Cache Manager**: Orchestrates a pool of pods for each KRM function type via channel-based communication, supporting configurable parallel pod execution and per-pod request waitlists
 - **Pod Manager**: Handles pod and service CRUD operations
 - **Pod Creation**: Template-based pod creation with init container for wrapper server injection
 - **Service Management**: ClusterIP service frontends for service mesh compatibility
-- **TTL-Based Caching**: Reuses pods with configurable expiration and extension on use
+- **TTL-Based Caching**: Reuses pods with configurable expiration per pod; each pod in the pool has its own TTL
 - **Garbage Collection**: Periodic cleanup of expired pods and failed pod handling
 - **Pod Warming**: Pre-creates pods for frequently-used functions
+- **Horizontal Scaling**: Supports multiple parallel pods per function type (maxParallelPodsPerFunction) with configurable waitlist limits (maxWaitlistLength), improving throughput under high load while maintaining resource efficiency through TTL-based cleanup
 
 For detailed architecture and process flows, see [Pod Lifecycle Management]({{% relref "/docs/5_architecture_and_components/function-runner/functionality/pod-lifecycle-management.md" %}}).
 
@@ -86,13 +87,15 @@
 1. **Function Evaluation** receives gRPC request from Task Handler
 2. **Multi-Evaluator** tries executable evaluator first (fast path)
 3. **If NotFound**, falls back to pod evaluator (container execution)
-4. **Pod Lifecycle Management** checks pod cache for existing pod
-5. **If cache miss**, creates new pod with wrapper server via Pod Manager
-6. **Image & Registry Management** resolves image metadata and authentication
-7. **Pod Manager** creates pod with image pull secrets and service frontend
-8. **Pod Cache Manager** stores pod with TTL for reuse
-9. **Function Evaluation** connects to pod via service and executes function
-10. **Wrapper Server** executes function binary and returns structured results
-11. **Garbage Collection** periodically removes expired pods from cache
+4. **Pod Lifecycle Management** checks pod cache for available pods of the requested function type
+5. **If available pod found**, request is assigned to the pod with the shortest waitlist
+6. **If all pods busy but waitlists have capacity**, request is queued in the shortest waitlist
+7. **If all pods and waitlists full**, creates new pod (up to maxParallelPodsPerFunction limit) via Pod Manager
+8. **Image & Registry Management** resolves image metadata and authentication
+9. **Pod Manager** creates pod with image pull secrets and service frontend
+10. **Pod Cache Manager** stores pod in function's pool with individual TTL for reuse
+11. **Function Evaluation** connects to pod via service and executes function
+12. **Wrapper Server** executes function binary and returns structured results
+13. **Garbage Collection** periodically removes expired pods from each function's pool
 
 Each functional area is documented in detail on its own page with architecture diagrams, process flows, and implementation specifics.
function-evaluation
View Changes
@@ -8,7 +8,7 @@
 
 ## Overview
 
-Function evaluation is the core responsibility of the Function Runner - executing KRM (Kubernetes Resource Model) functions through pluggable evaluator strategies. The system uses a strategy pattern where different evaluators handle function execution in different ways (pod-based, executable, or chained), all conforming to a common interface.
+Function evaluation is the core responsibility of the Function Runner - executing KRM (Kubernetes Resource Model) functions through pluggable evaluator strategies. The system uses a strategy pattern where different evaluators handle function execution in different ways (pod-based, executable, or chained), all conforming to a common interface. The pod-based evaluator supports horizontal scaling through multiple parallel pods per function type, enabling high-throughput execution for demanding workloads.
 
 ### High-Level Architecture
 
@@ -133,18 +133,24 @@
 
 ### Waitlist Mechanism
 
-Prevents duplicate pod creation when multiple requests arrive for the same function:
-
-**Waitlist pattern:**
-- Multiple requests for same image queue up
-- Single pod creation serves all waiters
-- Batch notification when pod ready
-- Prevents duplicate pod creation
+Each pod maintains its own waitlist to queue requests and enable efficient load distribution:
+
+**Per-pod waitlist pattern:**
+- Each pod instance has a separate waitlist with configurable maximum length
+- Incoming requests are assigned to the pod with the shortest waitlist
+- Multiple requests for the same image can execute across different pods in parallel
+- When all pods are busy and all waitlists reach maximum length, a new pod is created (up to `maxParallelPodsPerFunction`)
+
+**Load distribution:**
+- Requests distributed to least-loaded pod (shortest waitlist)
+- Ties broken by selecting lowest pod index
+- Enables horizontal scaling when demand increases
+- Automatic scale-down through garbage collection of idle pods
 
 **Error handling:**
-- Pod creation errors sent to all waiters
-- Waitlist cleared on error
-- Each waiter receives error independently
+- Pod creation errors sent to all waiters in that pod's waitlist
+- Failed pod removed from pool and waitlist redistributed to other pods
+- Each waiter receives error independently if redistribution fails
 - Allows retry on next request
 
 ### Function Execution
@@ -417,6 +423,18 @@
 - Garbage collection removes expired pods
 - Failed pods immediately deleted
 
+**Parallel execution configuration:**
+- `maxParallelPodsPerFunction`: Controls how many pods can run simultaneously for each function type (default: 1)
+- `maxWaitlistLength`: Controls how many requests can queue per pod before triggering new pod creation (default: 2)
+- Configurable globally via command-line arguments or per-function via `pod-cache-config.yaml`
+- Per-function configuration overrides global defaults
+
+**Scaling behavior:**
+- New pod created when all existing pods' waitlists reach `maxWaitlistLength`
+- Maximum pods per function limited by `maxParallelPodsPerFunction`
+- Idle pods beyond TTL are garbage collected
+- Enables automatic scaling based on load
+
 ### Cache Warming
 
 **Warming strategy:**
@@ -437,13 +455,19 @@
 - Multiple requests can execute concurrently
 - Each request gets own gRPC connection
 - Pod cache manager coordinates access
-- Waitlist prevents duplicate pod creation
+- Per-pod waitlists manage queuing and load distribution
 
 **Concurrency characteristics:**
-- Same function, same pod: Sequential (one at a time)
-- Same function, different pods: Concurrent
+- Same function, same pod: Sequential (one at a time per pod)
+- Same function, multiple pods: Parallel execution across pods (up to `maxParallelPodsPerFunction`)
 - Different functions: Fully concurrent
-- No artificial concurrency limits
+- No artificial concurrency limits beyond configured maximums
+
+**Parallel pod scaling:**
+- Multiple pods can run simultaneously for the same function type
+- Each pod maintains its own waitlist of queued requests
+- New pods created when all waitlists reach `maxWaitlistLength` (up to `maxParallelPodsPerFunction`)
+- Enables high-throughput execution for demanding workloads
 
 ### Resource Limits
 
@@ -458,3 +482,5 @@
 - Configure cache warming for hot functions
 - Use executable evaluator for critical path
 - Monitor pod resource usage
+- Tune `maxParallelPodsPerFunction` for high-load scenarios
+- Adjust `maxWaitlistLength` to balance responsiveness vs pod overhead

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants