Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another attempt at making test_vm_bits less flaky #7989

Merged
merged 4 commits into from
Jun 12, 2024

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented Jun 6, 2024

  • Split the first and second parts of the test to two separate tests

  • In the first test, disable the aggressive GC, compaction, and autovacuum. They are only needed by the second test. I'd like to get the first test to a point that the VM page is never all-zeros. Disabling autovacuum in the first test is hopefully enough to accomplish that.

  • Compare the full page images, don't skip page header. After fixing the previous point, there should be no discrepancy. LSN still won't match, though, because of commit 387a368.

Fixes issue #7984

Copy link

github-actions bot commented Jun 6, 2024

3198 tests run: 3059 passed, 0 failed, 139 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_storage_controller_smoke: release

Code coverage* (full report)

  • functions: 31.5% (6604 of 20963 functions)
  • lines: 48.4% (51063 of 105401 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
eacb09a at 2024-06-10T12:13:07.261Z :recycle:

- Split the first and second parts of the test to two separate tests

- In the first test, disable the aggressive GC, compaction, and
  autovacuum. They are only needed by the second test. I'd like to get
  the first test to a point that the VM page is never all-zeros.
  Disabling autovacuum in the first test is hopefully enough
  to accomplish that.

- Compare the full page images, don't skip page header. After fixing
  the previous point, there should be no discrepancy. LSN still won't
  match, though, because of commit 387a368.

Fixes issue #7984
Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your arguments about the source of the flukiness of this test.
And hope that your refactoring of the test will fix it.

But I do no agree with this statement:

LSN still won't match, though, because of commit 387a368

LSN did not much not because of [387a368]
You have excluded the LSN from comparison in
[428d9fe] which was merged long before my PR.

Also I am not sure that reverting my commit [b06eec4] (ignoring page header in comparison) will not resurrect the problem which this commit is going to address: when zero page is compared with empty page.

We have suspicious code in pagestore_smgr which treats empty page as zero page:

		else if (PageIsEmptyHeapPage((Page) buffer))
		{
			ereport(SmgrTrace,
					(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is an empty heap page with no LSN",
							blocknum,
							RelFileInfoFmt(InfoFromSMgrRel(reln)),
							forknum)));
		}

I did not remember why it was necessary. May be we should comment it and check what is broken? Yes, in case of VM this branch should not normally be reachable because we force was-logging VM pages. But only when shutdown is not requested. If VM pages is flushed during shutdown, can it also happen (replacing empty page with zero page) for VM page?

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Jun 7, 2024

But I do no agree with this statement:

LSN still won't match, though, because of commit 387a368

LSN did not much not because of [387a368] You have excluded the LSN from comparison in [428d9fe] which was merged long before my PR.

Yes, you're right. I thought autovacuum=off would make it deterministic, but the checkpointer or bgwriter might also decide to write it out.

Also I am not sure that reverting my commit [b06eec4] (ignoring page header in comparison) will not resurrect the problem which this commit is going to address: when zero page is compared with empty page.

With autovacuum=off, I'm hoping that VACUUM FREEZE command will always succeed in freezing the page, and the VM bits are set, hence no empty page.

We have suspicious code in pagestore_smgr which treats empty page as zero page:

		else if (PageIsEmptyHeapPage((Page) buffer))
		{
			ereport(SmgrTrace,
					(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is an empty heap page with no LSN",
							blocknum,
							RelFileInfoFmt(InfoFromSMgrRel(reln)),
							forknum)));
		}

I did not remember why it was necessary. May be we should comment it and check what is broken? Yes, in case of VM this branch should not normally be reachable because we force was-logging VM pages. But only when shutdown is not requested. If VM pages is flushed during shutdown, can it also happen (replacing empty page with zero page) for VM page?

Yeah, we should also WAL-log VM pages when shutdown has been requested. And probably FSM pages too, although that's not required for correctness.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flakiness is still there, no?


diff --git a/test_runner/regress/test_vm_bits.py b/test_runner/regress/test_vm_bits.py
index 9783cc442..e8c05e064 100644
--- a/test_runner/regress/test_vm_bits.py
+++ b/test_runner/regress/test_vm_bits.py
@@ -1,4 +1,5 @@
 import time
+import pytest
 
 from fixtures.log_helper import log
 from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, fork_at_current_lsn
@@ -8,6 +9,8 @@ from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, fork_at_current_lsn
 # Test that the VM bit is cleared correctly at a HEAP_DELETE and
 # HEAP_UPDATE record.
 #
+
+@pytest.mark.skip(reason="not flaky")
 def test_vm_bit_clear(neon_simple_env: NeonEnv):
     env = neon_simple_env
 
@@ -119,6 +122,7 @@ def test_vm_bit_clear(neon_simple_env: NeonEnv):
 #
 # This is a repro for the bug fixed in commit 66fa176cc8.
 #
+@pytest.mark.repeat(1000)
 def test_vm_bit_clear_on_heap_lock_whitebox(neon_env_builder: NeonEnvBuilder):
     env = neon_env_builder.init_start()
 
@@ -172,6 +176,7 @@ def test_vm_bit_clear_on_heap_lock_whitebox(neon_env_builder: NeonEnvBuilder):
     assert vm_page_at_pageserver == vm_page_in_cache
 
 
+@pytest.mark.repeat(1000)
 def test_vm_bit_clear_on_heap_lock_blackbox(neon_env_builder: NeonEnvBuilder):
     """
     The previous test is enough to verify the bug that was fixed in

Run 4 tests concurrently with

BUILD_TYPE=DEBUG DEFAULT_PG_VERSION=16 ./scripts/pytest -n4 ./test_runner/regress/test_vm_bits.py

Failed after ca 300 runs with

>       assert vm_page_at_pageserver == vm_page_in_cache
E       AssertionError: assert '000000000000...0000000000000' == '000000001800...0000000000000'
E         - 0000000018000020002004200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
E         ? -----------------------
E         + 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
E         ?                                                                                                                                                                  +++++++++++++++++++++++

test_runner/regress/test_vm_bits.py:176: AssertionError

failure.tar.gz

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Jun 10, 2024

The flakiness is still there, no?

Hmm, ok, thanks for the tip on how to repeat it.

So who's holding back the xmin horizon? There shouldn't be any other concurrent transactions running.

Turns out there is: the activity monitor that's part of compute_ctl periodically runs queries to get the status of replication and running queries. If that runs concurrently with the INSERT and VACUUM steps in the test, the VACUUM cannot freeze the just-inserted tuples. This makes it happen much more frequently:

diff --git a/compute_tools/src/monitor.rs b/compute_tools/src/monitor.rs
index 872a3f775..fdaa7d488 100644
--- a/compute_tools/src/monitor.rs
+++ b/compute_tools/src/monitor.rs
@@ -118,7 +118,7 @@ fn watch_compute_activity(compute: &ComputeNode) {
                 //
                 // walproposer doesn't currently show up in pg_stat_replication,
                 // but protect if it will be
-                let ws_count_query = "select count(*) from pg_stat_replication where application_name != 'walproposer';";
+                let ws_count_query = "select count(*), pg_sleep(5) from pg_stat_replication where application_name != 'walproposer';";
                 match cli.query_one(ws_count_query, &[]) {
                     Ok(r) => match r.try_get::<&str, i64>("count") {
                         Ok(num_ws) => {

To work around that, I made the test to run in a separate database. That way, the background activity in the 'postgres' database doesn't hold back the xmin horizon in the test database.

With that, I've run about 600 iterations on my laptop now, with no failures.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm that with your changes the flakiness is gone 🥳

@hlinnaka hlinnaka merged commit 9983ae2 into main Jun 12, 2024
64 checks passed
@hlinnaka hlinnaka deleted the test_vm_bits-less-flaky branch June 12, 2024 06:18
VladLazar pushed a commit that referenced this pull request Jun 12, 2024
- Split the first and second parts of the test to two separate tests

- In the first test, disable the aggressive GC, compaction, and
autovacuum. They are only needed by the second test. I'd like to get the
first test to a point that the VM page is never all-zeros. Disabling
autovacuum in the first test is hopefully enough to accomplish that.

- Compare the full page images, don't skip page header. After fixing the
previous point, there should be no discrepancy. LSN still won't match,
though, because of commit 387a368.

Fixes issue #7984
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants