Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Replication: connect: cannot assign requested address #24274

Closed
C-Aniruddh opened this issue Jun 6, 2023 · 1 comment
Closed

Edge Replication: connect: cannot assign requested address #24274

C-Aniruddh opened this issue Jun 6, 2023 · 1 comment

Comments

@C-Aniruddh
Copy link

I have a database setup with 3 InfluxDB nodes in a Primary / Replica configuration. The primary node is where I am sending reads / writes, and maintaining the replica nodes as backups. Within this configuration, I have four buckets in each node. So my total replication count is 4 (buckets) * 2 (replicas) = 8.

I have a writer that is writing ~300,000 points / second across all the buckets to the master node (of variable sizes, some in a few KBs). When the master starts to replicate data to the replicas, I start seeing the following error:

ts=2023-06-06T22:39:47.378875Z lvl=error msg="Error in replication stream" log_id=0iGfpo8l000 service=replications replication_id=0b50a3b294085000 error="Post \"http://<ip-address>:8086/api/v2/write?bucket=f096d4afe945d373&org=af991c6c2990a8b9\": dial tcp <ip-address>:8086: connect: cannot assign requested address" retries=2495

This error generally starts showing up after around 20-30 seconds.

From what I know, this is happening because there are too many TCP connections being made to the replicas on the master node and in doing so, I exhaust all available local ephemeral ports. A few fixes I have tried on my end is setting the following parameters:

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_max_tw_buckets=20000
net.ipv4.tcp_fastopen=3
echo '1024 65000' > /proc/sys/net/ipv4/ip_local_port_range

I have also increased the ulimits in my docker compose:

ulimits:
  nofile:
    soft: 400000
    hard: 800000

With these fixes, I still see the error after a while, and then when one of these TCP connections is available, the retry count resets back to 0.

I have a hunch that this can be resolved by modifying the HTTP client parameters such as MaxIdleConnsPerHost, but I do not have access to modify the replication writeAPI client parameters as needed from outside the codebase.

For further clarification on setup, all of the InfluxDB nodes are on separate VMs (but they are on the same local network, with < 0.5ms ping).

Steps to reproduce:
List the minimal actions needed to reproduce the behavior.

  1. Setup a master influxdb node, and two replica nodes on 3 different servers.
  2. Setup the replications of 4 buckets across 2 replicas on the master
  3. Write huge amount of variable sized data to all the buckets
  4. Wait for a minute, to start seeing errors.

Expected behavior:
Expected behavior is that the data is transported to the replicas using a shared connection pool or is configured to reuse TCP connections.

Actual behavior:
Large amount of TCP connections causing unavailability of open port for binding target address.

Environment info:

  • System info: Run uname -srm and copy the output here
Linux 5.15.0-71-generic x86_64
  • InfluxDB version: Run influxd version and copy the output here
InfluxDB v2.7.0 (git: 85f725f8b9) build_date: 2023-04-05T15:32:25Z
  • Other relevant environment details: Container runtime, disk info, etc

Config:
Copy any non-default config values here or attach the full config as a gist or file.

Logs:
Include snippet of errors in log.

Performance:
Generate profiles with the following commands for bugs related to performance, locking, out of memory (OOM), etc.

# Commands should be run when the bug is actively happening.
# Note: This command will run for ~30 seconds.
curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
iostat -xd 1 30 > iostat.txt
# Attach the `profiles.tar.gz` and `iostat.txt` output files.
@C-Aniruddh
Copy link
Author

PR #23997 seems to resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant