Skip to content

dns.lookup: pending promises grow unboundedly under sustained EAI_AGAIN load #62503

@orgads

Description

@orgads

Version

v24.14.1

Platform

Linux 6.6.87.2-microsoft-standard-WSL2 x86_64 (Debian forky/sid)

Subsystem

dns

What steps will reproduce the bug?

When dns.promises.lookup() is called repeatedly for a hostname that triggers EAI_AGAIN (e.g. a nonsense TLD like 'asd'), getaddrinfo blocks its libuv thread for ~10 seconds before rejecting. Under sustained load, new lookups queue behind the blocked threads and the number of unresolved promises grows without bound.

In our production system this manifests as a memory leak: each pending promise holds references to the calling closure (in our case, per-conversation config objects), and none of them can be GC'd until getaddrinfo eventually returns.

Minimal reproduction — save as repro.mjs:

import { lookup } from 'node:dns/promises';

const HOSTNAME = 'asd'; // triggers EAI_AGAIN
const INTERVAL_MS = 2000;
const TOTAL = 50;

let pending = 0;
let started = 0;
let settled = 0;

function fire() {
  const id = ++started;
  ++pending;
  const t0 = Date.now();
  console.log(`[#${id}] started  (pending: ${pending})`);

  lookup(HOSTNAME).then(
    (result) => {
      --pending;
      ++settled;
      console.log(`[#${id}] resolved ${result.address} after ${Date.now() - t0}ms  (pending: ${pending})`);
    },
    (err) => {
      --pending;
      ++settled;
      console.log(`[#${id}] rejected ${err.code} after ${Date.now() - t0}ms  (pending: ${pending})`);
    }
  );
}

const interval = setInterval(() => {
  if (started >= TOTAL) {
    clearInterval(interval);
    return;
  }
  fire();
}, INTERVAL_MS);

setInterval(() => {
  console.log(`\n--- status: started=${started} settled=${settled} pending=${pending} ---\n`);
}, 5000).unref();

setTimeout(() => {
  console.log(`\n=== FINAL: started=${started} settled=${settled} stuck=${pending} ===`);
  if (pending > 0)
    console.log(`LEAK CONFIRMED: ${pending} lookup(s) never settled.`);
  else
    console.log('No leak detected.');
  process.exit(pending > 0 ? 1 : 0);
}, TOTAL * INTERVAL_MS + 60_000);

Run:

node repro.mjs

How often does it reproduce? Is there a required condition?

Always

What is the expected behavior? Why is that the expected behavior?

Each lookup() call should resolve or reject in bounded time, regardless of the hostname. Under sustained load, the number of pending (unresolved) promises should stay bounded — ideally proportional to the libuv thread pool size (UV_THREADPOOL_SIZE, default 4), not to the total number of requests issued.

What do you see instead?

  • Each getaddrinfo call for 'asd' blocks a libuv thread for ~10 seconds before rejecting with EAI_AGAIN.
  • Only ~2 rejections return per 10-second cycle (thread pool saturation).
  • New lookups queue behind the blocked threads, so the pending count grows monotonically: 5 → 8 → 11 → 14 → 17 → 20 → …
  • Promises are never settled until their turn comes, which under continuous load may be never.

Sample output (trimmed):

[#1] started  (pending: 1)
[#2] started  (pending: 2)
[#3] started  (pending: 3)
[#4] started  (pending: 4)
[#5] started  (pending: 5)
[#6] started  (pending: 6)
[#1] rejected EAI_AGAIN after 10031ms  (pending: 5)
[#7] started  (pending: 6)
[#2] rejected EAI_AGAIN after 10014ms  (pending: 5)

--- status: started=7 settled=2 pending=5 ---

[#8] started  (pending: 6)
...
[#12] started  (pending: 9)
[#4] rejected EAI_AGAIN after 16024ms  (pending: 8)

--- status: started=12 settled=4 pending=8 ---

...
[#32] started  (pending: 21)
[#12] rejected EAI_AGAIN after 40068ms  (pending: 20)

--- status: started=32 settled=12 pending=20 ---

The pending count never converges to zero. In a long-running server, this is effectively a memory leak because every queued promise retains references to its enclosing closure.

Additional information

  • UV_THREADPOOL_SIZE is unset (default 4).
  • The callback-based dns.lookup() shows the same behavior — this is a libuv/getaddrinfo issue, not specific to the promises API.
  • Hostnames that fail fast (e.g. ENOTFOUND) do not exhibit this problem.
  • The EAI_AGAIN timeout (~10s per attempt) appears to come from the system resolver's retry/timeout settings, but the queuing behind the fixed-size thread pool is what makes it unbounded.
  • A possible mitigation from Node's side: support AbortSignal in dns.lookup() / dns.promises.lookup() so callers can cancel stale lookups, or expose a configurable per-lookup timeout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions