Skip to content

test_safekeeper_without_pageserver fails sporadically due to slow fsync on specific runners #11627

@alexanderlaw

Description

@alexanderlaw

Multiple failures of test_safekeeper_without_pageserver (23 during the past 30 days) occurred because of errors like:

E   Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:18750', '--listen-http', '127.0.0.1:18752', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-2/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:18749'], \
 reason: Failed to get correct status from subprocess

are caused by slow fsync on specific runner(s).

I've analyzed several ones out of 23 (all occurred on X64):

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11289/13949906475/index.html#testresult/77e68acea0c08a38
test_safekeeper_without_pageserver[release-pg17] / __arch: 'X64' / __lfc: 'without-lfc' / __sanitizers: 'disabled'

test_runner/regress/test_wal_acceptor.py:1519: in test_safekeeper_without_pageserver
    env.init()
test_runner/regress/test_wal_acceptor.py:1417: in init
    self.safekeepers.append(self.start_safekeeper(i))
test_runner/regress/test_wal_acceptor.py:1455: in start_safekeeper
    safekeeper_process = start_in_background(
test_runner/fixtures/utils.py:384: in start_in_background
    raise Exception(f"Failed to run subprocess as {command}, reason: {error}")
E   Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:29028', '--listen-http', '127.0.0.1:29030', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:29027'], reason: Failed to get correct status from subprocess

safekeeper.log

2025-03-19T15:37:54.622664Z  INFO version: git-env:d72f1ecee91cfeafa6d11b349a53d5051b28377b
2025-03-19T15:37:54.622718Z  INFO buld_tag: build_tag-env:13949906475
2025-03-19T15:37:54.628513Z  INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-1/repo/sk1/safekeeper.pid"
2025-03-19T15:37:54.629651Z  INFO initialized safekeeper id 1
2025-03-19T15:37:54.629679Z  INFO pg auth is disabled
2025-03-19T15:37:54.629684Z  INFO pg tenant only auth is disabled
2025-03-19T15:37:54.629691Z  INFO http auth is disabled
2025-03-19T15:37:54.629701Z  INFO no JWT token for authentication with safekeepers detected
2025-03-19T15:37:54.629732Z  INFO not initializing Sentry, no SENTRY_DSN given
--- EOF ---

The corresponding job:
https://github.com/neondatabase/neon/actions/runs/13949906475/job/39047159039

The log of build-and-test-locally (x64, release) / regress-tests (v17, without-lfc) contains:

2025-03-19T15:16:04.0587755Z Current runner version: '2.322.0'
2025-03-19T15:16:04.0592830Z Runner name: 'qemu-x64-21-24c40g-6'
...
2025-03-19T15:38:05.2652959Z test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
2025-03-19T15:38:05.2656807Z [gw9] [ 91%] RERUN test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
2025-03-19T15:38:05.9631428Z test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
...
2025-03-19T15:38:12.8279817Z [gw9] [ 92%] PASSED test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]

safekeeper.log from the successful retry:

2025-03-19T15:38:05.263006Z  INFO version: git-env:d72f1ecee91cfeafa6d11b349a53d5051b28377b
2025-03-19T15:38:05.263072Z  INFO buld_tag: build_tag-env:13949906475
2025-03-19T15:38:05.396078Z  INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-2/repo/sk1/safekeeper.pid"
2025-03-19T15:38:05.399170Z  INFO initialized safekeeper id 1
2025-03-19T15:38:05.399220Z  INFO pg auth is disabled
2025-03-19T15:38:05.399230Z  INFO pg tenant only auth is disabled
2025-03-19T15:38:05.399239Z  INFO http auth is disabled
2025-03-19T15:38:05.399250Z  INFO no JWT token for authentication with safekeepers detected
2025-03-19T15:38:05.399263Z  INFO not initializing Sentry, no SENTRY_DSN given
2025-03-19T15:38:09.954181Z  INFO syncfs data directory done elapsed_ms=4554

Note the duration of the syncfs operation.

The corresponding fragment of the start_in_background() function:

def start_in_background(
    command: list[str], cwd: Path, log_file_name: str, is_started: Callable[[], WaitUntilRet]
) -> subprocess.Popen[bytes]:
    """Starts a process, creates the logfile and redirects stderr and stdout there. Runs the start checks before the process is started, or errors."""

    log.info(f'Running command "{" ".join(command)}"')

    with open(cwd / log_file_name, "wb") as log_file:
...
                try:
                    wait_until(is_started, timeout=10)
                except Exception:
                    error = "Failed to get correct status from subprocess"
...

That is, the safekeeper process is expected to start within 10 seconds, but as we can see from the successful run, syncfs alone can take more than 4 seconds.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11327/13989438844/index.html#testresult/2f826775f1405dea
test_safekeeper_without_pageserver[release-pg15] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'

E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:20232', '--listen-http', '127.0.0.1:20234', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:20231'], reason: Failed to get correct status from subprocess

safekeeper.log

2025-03-21T10:43:05.894048Z  INFO version: git-env:3e401e27159fde71068a46eac092391289d82485
2025-03-21T10:43:05.894078Z  INFO buld_tag: build_tag-env:13989438844
2025-03-21T10:43:05.896420Z  INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-1/repo/sk1/safekeeper.pid"
2025-03-21T10:43:05.897352Z  INFO initialized safekeeper id 1
2025-03-21T10:43:05.897376Z  INFO pg auth is disabled
2025-03-21T10:43:05.897382Z  INFO pg tenant only auth is disabled
2025-03-21T10:43:05.897387Z  INFO http auth is disabled
2025-03-21T10:43:05.897395Z  INFO no JWT token for authentication with safekeepers detected
2025-03-21T10:43:05.897411Z  INFO not initializing Sentry, no SENTRY_DSN given
--- EOF ---

The corresponding job:
https://github.com/neondatabase/neon/actions/runs/13989438844/attempts/1

The log of build-and-test-locally (x64, release) / regress-tests (v15, with-lfc) contains:

2025-03-21T10:23:40.4378195Z Current runner version: '2.322.0'
2025-03-21T10:23:40.4383374Z Runner name: 'qemu-x64-21-24c40g-2'
...
2025-03-21T10:45:14.7839077Z 2025-03-21 10:43:15.929 ERROR [utils.py:382] Failed to get correct status from subprocess

safekeeper.log from the successful retry:

The successful retry:
2025-03-21T10:43:17.918195Z  INFO version: git-env:3e401e27159fde71068a46eac092391289d82485
2025-03-21T10:43:17.918231Z  INFO buld_tag: build_tag-env:13989438844
2025-03-21T10:43:18.524044Z  INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-2/repo/sk1/safekeeper.pid"
2025-03-21T10:43:19.029492Z  INFO initialized safekeeper id 1
2025-03-21T10:43:19.029520Z  INFO pg auth is disabled
2025-03-21T10:43:19.029524Z  INFO pg tenant only auth is disabled
2025-03-21T10:43:19.029528Z  INFO http auth is disabled
2025-03-21T10:43:19.029544Z  INFO no JWT token for authentication with safekeepers detected
2025-03-21T10:43:19.029559Z  INFO not initializing Sentry, no SENTRY_DSN given
2025-03-21T10:43:19.998179Z  INFO syncfs data directory done elapsed_ms=968

The following info is minimized for brevity.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11440/14249341894/index.html#testresult/a3046f8ed71139b4
test_safekeeper_without_pageserver[release-pg15] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'

E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:15654', '--listen-http', '127.0.0.1:15656', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:15653'], reason: Failed to get correct status from subprocess

https://github.com/neondatabase/neon/actions/runs/14249341894/attempts/1

2025-04-03T17:53:34.1032495Z Current runner version: '2.323.0'
2025-04-03T17:53:34.1038108Z Runner name: 'qemu-x64-21-24c40g-4'

...

The successful retry:

2025-04-03T18:16:44.353687Z  INFO syncfs data directory done elapsed_ms=2072

https://neon-github-public-dev.s3.amazonaws.com/reports/main/14471549380/index.html#testresult/6027f3f516dcac38
test_safekeeper_without_pageserver[release-pg17] / __arch: 'X64' / __lfc: 'without-lfc' / __sanitizers: 'disabled'

E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:15779', '--listen-http', '127.0.0.1:15781', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:15778'], reason: Failed to get correct status from subprocess

https://github.com/neondatabase/neon/actions/runs/14471549380/job/40587376882

2025-04-15T14:25:00.7087314Z Current runner version: '2.323.0'
2025-04-15T14:25:00.7092866Z Runner name: 'qemu-x64-21-24c40g-2'

The successful retry:

2025-04-15T14:53:51.091606Z  INFO syncfs data directory done elapsed_ms=7364

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11590/14472304502/index.html#testresult/93e32e269e3fd304
test_safekeeper_without_pageserver[release-pg14] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'

E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:18746', '--listen-http', '127.0.0.1:18748', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:18745'], reason: Failed to get correct status from subprocess

https://github.com/neondatabase/neon/actions/runs/14472304502/attempts/1

2025-04-15T15:19:39.1622181Z Current runner version: '2.323.0'
2025-04-15T15:19:39.1630124Z Runner name: 'qemu-x64-21-24c40g-2'

The successful retry:

2025-04-15T15:48:57.713833Z  INFO syncfs data directory done elapsed_ms=1900

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11590/14472304502/index.html#testresult/251fae5332827a17
test_safekeeper_without_pageserver[release-pg14] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'

E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:18750', '--listen-http', '127.0.0.1:18752', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-2/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:18749'], reason: Failed to get correct status from subprocess

log:

2025-04-15 15:48:34.620 INFO [utils.py:365] Running command "/tmp/neon/bin/safekeeper -l 127.0.0.1:18746 --listen-http 127.0.0.1:18748 -D /tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-1/repo/sk1 --id 1 --broker-endpoint http://127.0.0.1:18745"
...
2025-04-15 15:48:43.664 INFO [utils.py:412] waiting for check_status: HTTPConnectionPool(host='localhost', port=18748): Max retries exceeded with url: /v1/status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20b07cefd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-04-15 15:48:44.667 ERROR [utils.py:383] Failed to get correct status from subprocess

safekeeper.log

2025-04-15T15:48:45.568619Z  INFO version: git-env:8952484b25f867ffdea6eb8a347d541e14facad6
...
2025-04-15T15:48:55.250253Z  INFO syncfs data directory done elapsed_ms=9054
2025-04-15T15:48:55.250294Z  INFO starting safekeeper WAL service on 127.0.0.1:18750
...
2025-04-15T15:48:55.594052Z  WARN broker: discover task failed: subscribe_by_filter request failed
--- EOF ---

Thus, even when syncfs takes 9 (<10) seconds, it might still not be enough for the test to get the subprocess status.

https://github.com/neondatabase/neon/actions/runs/14472304502/job/40591468794

2025-04-15T15:19:39.1622181Z Current runner version: '2.323.0'
2025-04-15T15:19:39.1630124Z Runner name: 'qemu-x64-21-24c40g-2'

The successful retry:

2025-04-15T15:48:57.713833Z  INFO syncfs data directory done elapsed_ms=1900

Note that for all the failures, I've analyzed, the tests were performed by qemu-x64-21-24c40g-x runners.

Moreover, it seems that when syncfs is so slow, it might also affect other tests. E.g., for the latter test suite run, we have also test_restart_endpoint_after_switch_wal[release-pg14], test_scrubber_physical_gc_ancestors_split[release-pg14] failed as flaky, and also some other non-pg14 ones.


When running this test on my laptop. even with 20 jobs in parallel:

DEFAULT_PG_VERSION=17 BUILD_TYPE=release poetry run pytest --count 20 -n20 test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver >test.log

I'm getting a much lower duration :

grep -h -r 'syncfs data directory done elapsed_ms' test_output/
2025-04-17T09:27:09.094727Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089638Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089623Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089648Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089637Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.095284Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089634Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089620Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089632Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.101060Z  INFO syncfs data directory done elapsed_ms=16
2025-04-17T09:27:09.095126Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089629Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.095672Z  INFO syncfs data directory done elapsed_ms=11
2025-04-17T09:27:09.089633Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089623Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.096047Z  INFO syncfs data directory done elapsed_ms=11
2025-04-17T09:27:09.095239Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.094849Z  INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.095471Z  INFO syncfs data directory done elapsed_ms=11
2025-04-17T09:27:09.095177Z  INFO syncfs data directory done elapsed_ms=10

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions