Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE #12935

Merged

Conversation

SmartKeyerror
Copy link
Contributor

The PANIC issue is bringed in by commit: Expand partition table leaves in parallel., here's the setp to reproduce it:

  1. Build gpdb without cassert, such as
./configure --with-perl --with-python --with-libxml --with-gssapi --enable-debug --disable-cassert --prefix=/usr/local/gpdb
  1. Run commands:
postgres=# create extension if not exists gp_debug_numsegments;
NOTICE:  extension "gp_debug_numsegments" already exists, skipping
CREATE EXTENSION
postgres=# select gp_debug_set_create_table_default_numsegments(1);
 gp_debug_set_create_table_default_numsegments 
-----------------------------------------------
 1
(1 row)

postgres=# begin;
BEGIN
postgres=# ALTER TABLE partition_test EXPAND PARTITION PREPARE;
ALTER TABLE
postgres=# select * from partition_test;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

If we exec ALTER TABLE partition_test EXPAND PARTITION PREPARE in a transaction and have an access to it next, sunch select * from partition_test or select * from gp_distribution_policy, QD or QE will PANIC cause wrong relation->rd_cdbpolicy.

The root cause is that we used GpPolicy data that may become invalid in function ATExecExpandPartitionTablePrepare:

static void
ATExecExpandPartitionTablePrepare(Relation rel)
{
        // ......
	if (GpPolicyIsRandomPartitioned(rel_dist) || rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
	{
		GpPolicyReplace(relid, root_dist);
		rel->rd_cdbpolicy = root_dist;        // here
	}
}

In fact, the ALTER TABLE command will invoke ATExecCmd(), which will call CommandCounterIncrement() to increase command numbers and invalid relation cache entry by CommandEndInvalidationMessages(), which invoke RelationClearRelation() eventually. The function RelationClearRelation() will physically blow away a relation cache entry, or reset it and rebuild it from scratch, also include the GpPolicy of a relation.

And in the inner of GpPolicyReplace(), it'll invalid the relation cache entry by CatalogTupleUpdate by send a message using SI (Share Invalid), which will be processed by ProcessInvalidationMessages in CommandCounterIncrement().

Since the relation in ATExecExpandPartitionTablePrepare is fetched by relation_open(), which could be a cache result in relation cache. If we update the rd_cdbpolicy, it could update the relation cache entry's rd_cdbpolicy to the new one, which not we wanted.

If the relation's GpPolicy is pointer to new policy, RelationClearRelation() will not swap the GpPolicy with a new relation entry, it still pointer to old memory address. And then, the release of memory will cause a null pointer exception.

@SmartKeyerror SmartKeyerror changed the title Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE [WIP] Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE Dec 15, 2021
@kainwen
Copy link
Contributor

kainwen commented Dec 15, 2021

LGTM.

Good job!

  1. Please create a GitHub issue to record the details
  2. reword a bit the commit message and add comments in the code.
  3. I think it is a good chance for you to try to find out why with-cassert not (or not easy) to reproduce?

@ashwinstar
Copy link
Contributor

Please create a GitHub issue to record the details

We can though not sure is necessary. We should try to capture most of context in commit message, change set and test associated with it. I understand many times github issue exists or mailing list discussion exist which has lot of context and referring it in code or commit message helps. In this case none exist and just to capture the context opening the issue while fixing seems overwork. Instead in this case can easily capture in commit context itself.

I think it is a good chance for you to try to find out why with-cassert not (or not easy) to reproduce?

Interesting, yes need to think, I fell chances of failure with asserts enabled will be higher as CLOBBER_FREED_MEMORY and RELCACHE_FORCE_RELEASE are defined with asserts enabled.

@SmartKeyerror SmartKeyerror changed the title [WIP] Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE Dec 20, 2021
src/backend/commands/tablecmds.c Outdated Show resolved Hide resolved
src/backend/commands/tablecmds.c Outdated Show resolved Hide resolved
src/backend/commands/tablecmds.c Outdated Show resolved Hide resolved
src/backend/commands/tablecmds.c Show resolved Hide resolved
{
GpPolicy *new_policy;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this codes are much wordy, but if we extraction this logic to a function, it's a bit weird:

/*
 * Change num segments of given relation, and make the policy between on-disk
 * catalog and on-memory relation cache consistently.
 */
void
GpPolicyReplaceNumSegments(Relation rel, const int new_numsegments)
{
	GpPolicy	 *new_policy;
	MemoryContext oldcontext;

	oldcontext = MemoryContextSwitchTo(GetMemoryChunkContext(rel));

	new_policy = GpPolicyCopy(rel->rd_cdbpolicy);
	new_policy->numsegments = new_numsegments;

	GpPolicyReplace(RelationGetRelid(rel), new_policy);

	/* We should make the policy between on-disk catalog and on-memory relation cache consistently */
	rel->rd_cdbpolicy = new_policy;
	MemoryContextSwitchTo(oldcontext);
}

It is not at the same level as GpPolicyReplace, just an encapsulation for it. So I think it's not appropriate to extraction the repeated code to a single function, which will be used rarely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree its okay to not extract the same in separate function

Copy link
Contributor

@ashwinstar ashwinstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just need to address that last comment

src/backend/commands/tablecmds.c Outdated Show resolved Hide resolved
{
GpPolicy *new_policy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree its okay to not extract the same in separate function

Copy link
Contributor

@ashwinstar ashwinstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for interactively incorporating the review comments

Copy link
Contributor

@kainwen kainwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

When merging this, please reword the commit message a bit. Might consider adding why with-cassert not produce the issue?

Thanks for you contribution!

@SmartKeyerror
Copy link
Contributor Author

@kainwen Actually, the version build with --enable-cassert also can reproduce this problem, if we use the example I posted in this pr, but it not failed in pipeline.

@kainwen kainwen closed this Dec 24, 2021
@kainwen kainwen reopened this Dec 24, 2021
@SmartKeyerror SmartKeyerror merged commit c3cd202 into greenplum-db:master Dec 24, 2021
@SmartKeyerror SmartKeyerror deleted the fix_gpexpand_pipeline branch March 8, 2022 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants