-
Notifications
You must be signed in to change notification settings - Fork 853
Description
Multiple failures of test_safekeeper_without_pageserver (23 during the past 30 days) occurred because of errors like:
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:18750', '--listen-http', '127.0.0.1:18752', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-2/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:18749'], \
reason: Failed to get correct status from subprocess
are caused by slow fsync on specific runner(s).
I've analyzed several ones out of 23 (all occurred on X64):
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11289/13949906475/index.html#testresult/77e68acea0c08a38
test_safekeeper_without_pageserver[release-pg17] / __arch: 'X64' / __lfc: 'without-lfc' / __sanitizers: 'disabled'
test_runner/regress/test_wal_acceptor.py:1519: in test_safekeeper_without_pageserver
env.init()
test_runner/regress/test_wal_acceptor.py:1417: in init
self.safekeepers.append(self.start_safekeeper(i))
test_runner/regress/test_wal_acceptor.py:1455: in start_safekeeper
safekeeper_process = start_in_background(
test_runner/fixtures/utils.py:384: in start_in_background
raise Exception(f"Failed to run subprocess as {command}, reason: {error}")
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:29028', '--listen-http', '127.0.0.1:29030', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:29027'], reason: Failed to get correct status from subprocess
safekeeper.log
2025-03-19T15:37:54.622664Z INFO version: git-env:d72f1ecee91cfeafa6d11b349a53d5051b28377b
2025-03-19T15:37:54.622718Z INFO buld_tag: build_tag-env:13949906475
2025-03-19T15:37:54.628513Z INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-1/repo/sk1/safekeeper.pid"
2025-03-19T15:37:54.629651Z INFO initialized safekeeper id 1
2025-03-19T15:37:54.629679Z INFO pg auth is disabled
2025-03-19T15:37:54.629684Z INFO pg tenant only auth is disabled
2025-03-19T15:37:54.629691Z INFO http auth is disabled
2025-03-19T15:37:54.629701Z INFO no JWT token for authentication with safekeepers detected
2025-03-19T15:37:54.629732Z INFO not initializing Sentry, no SENTRY_DSN given
--- EOF ---
The corresponding job:
https://github.com/neondatabase/neon/actions/runs/13949906475/job/39047159039
The log of build-and-test-locally (x64, release) / regress-tests (v17, without-lfc) contains:
2025-03-19T15:16:04.0587755Z Current runner version: '2.322.0'
2025-03-19T15:16:04.0592830Z Runner name: 'qemu-x64-21-24c40g-6'
...
2025-03-19T15:38:05.2652959Z test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
2025-03-19T15:38:05.2656807Z [gw9] [ 91%] RERUN test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
2025-03-19T15:38:05.9631428Z test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
...
2025-03-19T15:38:12.8279817Z [gw9] [ 92%] PASSED test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver[release-pg17]
safekeeper.log from the successful retry:
2025-03-19T15:38:05.263006Z INFO version: git-env:d72f1ecee91cfeafa6d11b349a53d5051b28377b
2025-03-19T15:38:05.263072Z INFO buld_tag: build_tag-env:13949906475
2025-03-19T15:38:05.396078Z INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-2/repo/sk1/safekeeper.pid"
2025-03-19T15:38:05.399170Z INFO initialized safekeeper id 1
2025-03-19T15:38:05.399220Z INFO pg auth is disabled
2025-03-19T15:38:05.399230Z INFO pg tenant only auth is disabled
2025-03-19T15:38:05.399239Z INFO http auth is disabled
2025-03-19T15:38:05.399250Z INFO no JWT token for authentication with safekeepers detected
2025-03-19T15:38:05.399263Z INFO not initializing Sentry, no SENTRY_DSN given
2025-03-19T15:38:09.954181Z INFO syncfs data directory done elapsed_ms=4554
Note the duration of the syncfs operation.
The corresponding fragment of the start_in_background() function:
def start_in_background(
command: list[str], cwd: Path, log_file_name: str, is_started: Callable[[], WaitUntilRet]
) -> subprocess.Popen[bytes]:
"""Starts a process, creates the logfile and redirects stderr and stdout there. Runs the start checks before the process is started, or errors."""
log.info(f'Running command "{" ".join(command)}"')
with open(cwd / log_file_name, "wb") as log_file:
...
try:
wait_until(is_started, timeout=10)
except Exception:
error = "Failed to get correct status from subprocess"
...
That is, the safekeeper process is expected to start within 10 seconds, but as we can see from the successful run, syncfs alone can take more than 4 seconds.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11327/13989438844/index.html#testresult/2f826775f1405dea
test_safekeeper_without_pageserver[release-pg15] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:20232', '--listen-http', '127.0.0.1:20234', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:20231'], reason: Failed to get correct status from subprocess
safekeeper.log
2025-03-21T10:43:05.894048Z INFO version: git-env:3e401e27159fde71068a46eac092391289d82485
2025-03-21T10:43:05.894078Z INFO buld_tag: build_tag-env:13989438844
2025-03-21T10:43:05.896420Z INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-1/repo/sk1/safekeeper.pid"
2025-03-21T10:43:05.897352Z INFO initialized safekeeper id 1
2025-03-21T10:43:05.897376Z INFO pg auth is disabled
2025-03-21T10:43:05.897382Z INFO pg tenant only auth is disabled
2025-03-21T10:43:05.897387Z INFO http auth is disabled
2025-03-21T10:43:05.897395Z INFO no JWT token for authentication with safekeepers detected
2025-03-21T10:43:05.897411Z INFO not initializing Sentry, no SENTRY_DSN given
--- EOF ---
The corresponding job:
https://github.com/neondatabase/neon/actions/runs/13989438844/attempts/1
The log of build-and-test-locally (x64, release) / regress-tests (v15, with-lfc) contains:
2025-03-21T10:23:40.4378195Z Current runner version: '2.322.0'
2025-03-21T10:23:40.4383374Z Runner name: 'qemu-x64-21-24c40g-2'
...
2025-03-21T10:45:14.7839077Z 2025-03-21 10:43:15.929 ERROR [utils.py:382] Failed to get correct status from subprocess
safekeeper.log from the successful retry:
The successful retry:
2025-03-21T10:43:17.918195Z INFO version: git-env:3e401e27159fde71068a46eac092391289d82485
2025-03-21T10:43:17.918231Z INFO buld_tag: build_tag-env:13989438844
2025-03-21T10:43:18.524044Z INFO claimed pid file at "/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-2/repo/sk1/safekeeper.pid"
2025-03-21T10:43:19.029492Z INFO initialized safekeeper id 1
2025-03-21T10:43:19.029520Z INFO pg auth is disabled
2025-03-21T10:43:19.029524Z INFO pg tenant only auth is disabled
2025-03-21T10:43:19.029528Z INFO http auth is disabled
2025-03-21T10:43:19.029544Z INFO no JWT token for authentication with safekeepers detected
2025-03-21T10:43:19.029559Z INFO not initializing Sentry, no SENTRY_DSN given
2025-03-21T10:43:19.998179Z INFO syncfs data directory done elapsed_ms=968
The following info is minimized for brevity.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11440/14249341894/index.html#testresult/a3046f8ed71139b4
test_safekeeper_without_pageserver[release-pg15] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:15654', '--listen-http', '127.0.0.1:15656', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg15]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:15653'], reason: Failed to get correct status from subprocess
https://github.com/neondatabase/neon/actions/runs/14249341894/attempts/1
2025-04-03T17:53:34.1032495Z Current runner version: '2.323.0'
2025-04-03T17:53:34.1038108Z Runner name: 'qemu-x64-21-24c40g-4'
...
The successful retry:
2025-04-03T18:16:44.353687Z INFO syncfs data directory done elapsed_ms=2072
https://neon-github-public-dev.s3.amazonaws.com/reports/main/14471549380/index.html#testresult/6027f3f516dcac38
test_safekeeper_without_pageserver[release-pg17] / __arch: 'X64' / __lfc: 'without-lfc' / __sanitizers: 'disabled'
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:15779', '--listen-http', '127.0.0.1:15781', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg17]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:15778'], reason: Failed to get correct status from subprocess
https://github.com/neondatabase/neon/actions/runs/14471549380/job/40587376882
2025-04-15T14:25:00.7087314Z Current runner version: '2.323.0'
2025-04-15T14:25:00.7092866Z Runner name: 'qemu-x64-21-24c40g-2'
The successful retry:
2025-04-15T14:53:51.091606Z INFO syncfs data directory done elapsed_ms=7364
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11590/14472304502/index.html#testresult/93e32e269e3fd304
test_safekeeper_without_pageserver[release-pg14] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:18746', '--listen-http', '127.0.0.1:18748', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-1/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:18745'], reason: Failed to get correct status from subprocess
https://github.com/neondatabase/neon/actions/runs/14472304502/attempts/1
2025-04-15T15:19:39.1622181Z Current runner version: '2.323.0'
2025-04-15T15:19:39.1630124Z Runner name: 'qemu-x64-21-24c40g-2'
The successful retry:
2025-04-15T15:48:57.713833Z INFO syncfs data directory done elapsed_ms=1900
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-11590/14472304502/index.html#testresult/251fae5332827a17
test_safekeeper_without_pageserver[release-pg14] / __arch: 'X64' / __lfc: 'with-lfc' / __sanitizers: 'disabled'
E Exception: Failed to run subprocess as ['/tmp/neon/bin/safekeeper', '-l', '127.0.0.1:18750', '--listen-http', '127.0.0.1:18752', '-D', '/tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-2/repo/sk1', '--id', '1', '--broker-endpoint', 'http://127.0.0.1:18749'], reason: Failed to get correct status from subprocess
log:
2025-04-15 15:48:34.620 INFO [utils.py:365] Running command "/tmp/neon/bin/safekeeper -l 127.0.0.1:18746 --listen-http 127.0.0.1:18748 -D /tmp/test_output/test_safekeeper_without_pageserver[release-pg14]-1/repo/sk1 --id 1 --broker-endpoint http://127.0.0.1:18745"
...
2025-04-15 15:48:43.664 INFO [utils.py:412] waiting for check_status: HTTPConnectionPool(host='localhost', port=18748): Max retries exceeded with url: /v1/status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20b07cefd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2025-04-15 15:48:44.667 ERROR [utils.py:383] Failed to get correct status from subprocess
safekeeper.log
2025-04-15T15:48:45.568619Z INFO version: git-env:8952484b25f867ffdea6eb8a347d541e14facad6
...
2025-04-15T15:48:55.250253Z INFO syncfs data directory done elapsed_ms=9054
2025-04-15T15:48:55.250294Z INFO starting safekeeper WAL service on 127.0.0.1:18750
...
2025-04-15T15:48:55.594052Z WARN broker: discover task failed: subscribe_by_filter request failed
--- EOF ---
Thus, even when syncfs takes 9 (<10) seconds, it might still not be enough for the test to get the subprocess status.
https://github.com/neondatabase/neon/actions/runs/14472304502/job/40591468794
2025-04-15T15:19:39.1622181Z Current runner version: '2.323.0'
2025-04-15T15:19:39.1630124Z Runner name: 'qemu-x64-21-24c40g-2'
The successful retry:
2025-04-15T15:48:57.713833Z INFO syncfs data directory done elapsed_ms=1900
Note that for all the failures, I've analyzed, the tests were performed by qemu-x64-21-24c40g-x runners.
Moreover, it seems that when syncfs is so slow, it might also affect other tests. E.g., for the latter test suite run, we have also test_restart_endpoint_after_switch_wal[release-pg14], test_scrubber_physical_gc_ancestors_split[release-pg14] failed as flaky, and also some other non-pg14 ones.
When running this test on my laptop. even with 20 jobs in parallel:
DEFAULT_PG_VERSION=17 BUILD_TYPE=release poetry run pytest --count 20 -n20 test_runner/regress/test_wal_acceptor.py::test_safekeeper_without_pageserver >test.log
I'm getting a much lower duration :
grep -h -r 'syncfs data directory done elapsed_ms' test_output/
2025-04-17T09:27:09.094727Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089638Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089623Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089648Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089637Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.095284Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089634Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089620Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089632Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.101060Z INFO syncfs data directory done elapsed_ms=16
2025-04-17T09:27:09.095126Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089629Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.095672Z INFO syncfs data directory done elapsed_ms=11
2025-04-17T09:27:09.089633Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.089623Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.096047Z INFO syncfs data directory done elapsed_ms=11
2025-04-17T09:27:09.095239Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.094849Z INFO syncfs data directory done elapsed_ms=10
2025-04-17T09:27:09.095471Z INFO syncfs data directory done elapsed_ms=11
2025-04-17T09:27:09.095177Z INFO syncfs data directory done elapsed_ms=10