New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE #12935
Fix PANIC error in ALTER TABLE xxx EXPAND PARTITION PREPARE #12935
Conversation
LGTM. Good job!
|
We can though not sure is necessary. We should try to capture most of context in commit message, change set and test associated with it. I understand many times github issue exists or mailing list discussion exist which has lot of context and referring it in code or commit message helps. In this case none exist and just to capture the context opening the issue while fixing seems overwork. Instead in this case can easily capture in commit context itself.
Interesting, yes need to think, I fell chances of failure with asserts enabled will be higher as |
{ | ||
GpPolicy *new_policy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this codes are much wordy, but if we extraction this logic to a function, it's a bit weird:
/*
* Change num segments of given relation, and make the policy between on-disk
* catalog and on-memory relation cache consistently.
*/
void
GpPolicyReplaceNumSegments(Relation rel, const int new_numsegments)
{
GpPolicy *new_policy;
MemoryContext oldcontext;
oldcontext = MemoryContextSwitchTo(GetMemoryChunkContext(rel));
new_policy = GpPolicyCopy(rel->rd_cdbpolicy);
new_policy->numsegments = new_numsegments;
GpPolicyReplace(RelationGetRelid(rel), new_policy);
/* We should make the policy between on-disk catalog and on-memory relation cache consistently */
rel->rd_cdbpolicy = new_policy;
MemoryContextSwitchTo(oldcontext);
}
It is not at the same level as GpPolicyReplace
, just an encapsulation for it. So I think it's not appropriate to extraction the repeated code to a single function, which will be used rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree its okay to not extract the same in separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just need to address that last comment
{ | ||
GpPolicy *new_policy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree its okay to not extract the same in separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for interactively incorporating the review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
When merging this, please reword the commit message a bit. Might consider adding why with-cassert not produce the issue?
Thanks for you contribution!
@kainwen Actually, the version build with |
The PANIC issue is bringed in by commit: Expand partition table leaves in parallel., here's the setp to reproduce it:
If we exec
ALTER TABLE partition_test EXPAND PARTITION PREPARE
in a transaction and have an access to it next, sunchselect * from partition_test
orselect * from gp_distribution_policy
, QD or QE will PANIC cause wrongrelation->rd_cdbpolicy
.The root cause is that we used GpPolicy data that may become invalid in function
ATExecExpandPartitionTablePrepare
:In fact, the
ALTER TABLE
command will invokeATExecCmd()
, which will callCommandCounterIncrement()
to increase command numbers and invalid relation cache entry byCommandEndInvalidationMessages()
, which invokeRelationClearRelation()
eventually. The functionRelationClearRelation()
will physically blow away a relation cache entry, or reset it and rebuild it from scratch, also include the GpPolicy of a relation.And in the inner of
GpPolicyReplace()
, it'll invalid the relation cache entry byCatalogTupleUpdate
by send a message using SI (Share Invalid), which will be processed byProcessInvalidationMessages
inCommandCounterIncrement()
.Since the relation in
ATExecExpandPartitionTablePrepare
is fetched byrelation_open()
, which could be a cache result in relation cache. If we update therd_cdbpolicy
, it could update the relation cache entry'srd_cdbpolicy
to the new one, which not we wanted.If the relation's GpPolicy is pointer to new policy,
RelationClearRelation()
will not swap the GpPolicy with a new relation entry, it still pointer to old memory address. And then, the release of memory will cause a null pointer exception.