You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug: Resource Limit Exceeded When Opening Outbound Stream
Problem Description
When attempting to open an outbound stream to a remote worker, we're encountering a resource limit error. This is preventing successful communication with worker nodes and impacting the network's ability to distribute tasks.
This error was observed on our gold-1 node in AWS and was resolved with a node restart.
Error message:
error sending work to worker: : error opening stream: failed to open stream: stream-57946: transient: cannot reserve outbound stream: resource limit exceeded
logs
time="2024-08-24T19:24:14Z" level=info msg="WorkerType is related to Twitter"
time="2024-08-24T19:24:14Z" level=info msg="Checking connections to eligible workers"
time="2024-08-24T19:24:14Z" level=info msg="Worker selection took 0 milliseconds"
time="2024-08-24T19:24:14Z" level=info msg="Starting round-robin worker selection"
time="2024-08-24T19:24:14Z" level=info msg="Attempting remote worker 16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4 (attempt 1/10)"
time="2024-08-24T19:24:14Z" level=error msg="error sending work to worker: : error opening stream: failed to open stream: stream-57946: transient: cannot reserve outbound stream: resource limit exceeded"
time="2024-08-24T19:24:14Z" level=info msg="Remote worker 16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4 failed, moving to next worker"
time="2024-08-24T19:24:32Z" level=info msg="Node left: /ip4/47.157.92.220/udp/1028/quic-v1/p2p/16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4"
time="2024-08-24T19:24:32Z" level=info msg="[+] Staked node joined: /ip4/47.157.92.220/udp/4001/quic-v1/p2p/16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4"
Current Resource Management
Our Oracle Node currently uses libp2p's default resource management configuration with auto-scaling:
This is using libp2p's limit.go and limit_defaults.go from the github.com/libp2p/go-libp2p/p2p/host/resource-manager package.
From limit_defaults.go:
scalingLimits:=rcmgr.DefaultLimits
This line is using the DefaultLimits variable defined in limit_defaults.go. It's a ScalingLimitConfig that provides default values for various resource limits.
Still from limit_defaults.go:
concreteLimits:=scalingLimits.AutoScale()
This calls the AutoScale() method of ScalingLimitConfig, which in turn calls the Scale() method with automatically determined memory and file descriptor values. The Scale() method applies the scaling logic to create a ConcreteLimitConfig.
From limit.go:
limiter:=rcmgr.NewFixedLimiter(concreteLimits)
This uses the NewFixedLimiter function defined in limit.go to create a new fixedLimiter with the concrete limits.
This creates a new ResourceManager using the fixedLimiter we just created.
The current configuration is using the default scaling limits provided by libp2p, which are then auto-scaled based on the system's available resources.
DefaultLimits provides a base configuration for various resource limits (connections, streams, memory, etc.) and how they should scale with available resources.
AutoScale() determines the available system resources (memory and file descriptors) and scales the limits accordingly.
The resulting ConcreteLimitConfig is used to create a fixedLimiter, which enforces these limits.
The ResourceManager is created with this limiter, which will then enforce these limits throughout the libp2p node's lifecycle.
We might want to:
Customize the DefaultLimits before calling AutoScale().
Implement our own scaling logic instead of using AutoScale().
Use a different type of limiter (e.g., a scaling limiter) instead of a fixed one.
Directly configure a ConcreteLimitConfig with values tailored to your application's needs.
For example, if you we want to increase the number of inbound connections allowed, we could do something like:
scalingLimits:=rcmgr.DefaultLimitsscalingLimits.SystemBaseLimit.ConnsInbound=128// Double the defaultscalingLimits.SystemLimitIncrease.ConnsInbound=128// Double the default increaseconcreteLimits:=scalingLimits.AutoScale()
This would allow for more inbound connections in both the base case and as the system scales up.
This setup allows for some automatic scaling of resources based on the system's capabilities, but it doesn't include any custom limits or fine-tuning for our specific needs.
Impact
Failure to open streams to remote workers
Potential bottleneck in task distribution
Reduced network efficiency and responsiveness
Suggested Solutions
Custom Resource Limits:
Implement custom resource limits tailored to our network's needs. For example:
Resource Usage Monitoring:
Implement a monitoring system to track resource usage and adjust limits dynamically.
Increase Cloud Infrastructure Limits:
Review and potentially increase the limits on our cloud infrastructure, particularly for network-related resources.
Connection Management:
Implement more aggressive connection management, including closing idle connections and limiting the maximum number of concurrent connections.
Backoff and Retry Mechanism:
Implement a backoff and retry mechanism when encountering resource limit errors to prevent immediate resource exhaustion.
Next Steps
Analyze current resource usage patterns
Implement and test custom resource limits
Set up monitoring for resource usage
Review and adjust cloud infrastructure limits
Implement connection management improvements
Develop and integrate a backoff and retry mechanism
The text was updated successfully, but these errors were encountered:
teslashibe
changed the title
bug:
Bug: Resource Limit Exceeded When Opening Outbound Stream
Aug 24, 2024
@restevens402@mudler@Luka-Loncar add this to no-status. Found a bug here. The default action should be to increase the CPU and RAM on the AWS node before software optimizations.
@5u6r054 to assist on increasing machine specs on Monday 👍
Bug: Resource Limit Exceeded When Opening Outbound Stream
Problem Description
When attempting to open an outbound stream to a remote worker, we're encountering a resource limit error. This is preventing successful communication with worker nodes and impacting the network's ability to distribute tasks.
This error was observed on our
gold-1
node in AWS and was resolved with a node restart.Error message:
logs
Current Resource Management
Our Oracle Node currently uses libp2p's default resource management configuration with auto-scaling:
Implementation here
(line 111)
: https://github.com/masa-finance/masa-oracle/blob/main/pkg/oracle_node.goThis is using libp2p's
limit.go
andlimit_defaults.go
from thegithub.com/libp2p/go-libp2p/p2p/host/resource-manager
package.limit_defaults.go
:This line is using the
DefaultLimits
variable defined inlimit_defaults.go
. It's aScalingLimitConfig
that provides default values for various resource limits.limit_defaults.go
:This calls the
AutoScale()
method ofScalingLimitConfig
, which in turn calls theScale()
method with automatically determined memory and file descriptor values. TheScale()
method applies the scaling logic to create aConcreteLimitConfig
.limit.go
:This uses the
NewFixedLimiter
function defined inlimit.go
to create a newfixedLimiter
with the concrete limits.limit.go
:This creates a new
ResourceManager
using thefixedLimiter
we just created.The current configuration is using the default scaling limits provided by libp2p, which are then auto-scaled based on the system's available resources.
DefaultLimits
provides a base configuration for various resource limits (connections, streams, memory, etc.) and how they should scale with available resources.AutoScale()
determines the available system resources (memory and file descriptors) and scales the limits accordingly.The resulting
ConcreteLimitConfig
is used to create afixedLimiter
, which enforces these limits.The
ResourceManager
is created with this limiter, which will then enforce these limits throughout the libp2p node's lifecycle.We might want to:
DefaultLimits
before callingAutoScale()
.AutoScale()
.ConcreteLimitConfig
with values tailored to your application's needs.For example, if you we want to increase the number of inbound connections allowed, we could do something like:
This would allow for more inbound connections in both the base case and as the system scales up.
This setup allows for some automatic scaling of resources based on the system's capabilities, but it doesn't include any custom limits or fine-tuning for our specific needs.
Impact
Suggested Solutions
Custom Resource Limits:
Implement custom resource limits tailored to our network's needs. For example:
Dynamic Limiter:
Replace the fixed limiter with a scaling limiter for more adaptive resource allocation:
Resource Usage Monitoring:
Implement a monitoring system to track resource usage and adjust limits dynamically.
Increase Cloud Infrastructure Limits:
Review and potentially increase the limits on our cloud infrastructure, particularly for network-related resources.
Connection Management:
Implement more aggressive connection management, including closing idle connections and limiting the maximum number of concurrent connections.
Backoff and Retry Mechanism:
Implement a backoff and retry mechanism when encountering resource limit errors to prevent immediate resource exhaustion.
Next Steps
The text was updated successfully, but these errors were encountered: