New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[6X FIX]Fix a bug of concurrently executing UPDATE/DELETE with VACUUM on a partition ao table #11895
Conversation
e0ca8f0
to
5953d23
Compare
f07ba29
to
f2353df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no AO tables in upstream PostgreSQL, but there are partitioned tables. How does the locking work for partitioned tables in upstream? It will be a good reference point in order to evaluate this change.
Hi @asimrp For Update | Delete on root table, upstream will lock all the leafs. For Insert, upstream's behavior, please refer to the thread: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/wAPKpJzhbpM/m/VEkOeMf_BQAJ |
52e1e21
to
7126f63
Compare
src/backend/parser/parse_clause.c
Outdated
|
||
pstate->p_target_relation = parserOpenTable(pstate, relation, RowExclusiveLock, NULL); | ||
Oid relid = RangeVarGetRelid(relation, NoLock, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get relid from pstate->p_target_relation
, do not need to get id from rangeVar.
src/backend/utils/cache/plancache.c
Outdated
if (parsetree->commandType == CMD_INSERT) | ||
lockmode = RowExclusiveLock; | ||
else | ||
lockmode = ExclusiveLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lockmode should also be deducted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lockmode should also be deducted here.
Yeah. We can just use lockmode
here. That's easier.
It seems there was consensus to adopt the locking strategy from upstream (lock all leaves): https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/-FELESD8BCc/m/W-wWpb7UAgAJ Is it possible to do the same on 6X_STABLE? Or such a change would be too intrusive for a stable branch? |
Hi @asimrp , 6X's behavior is not correct thus leads to this BUG. Without GDD, 6X locks all the leafs for insert, delete, update; this is just the same behavior as this commit now. Also, previous fix forget to modify the related code for cached plan. With GDD, 6X's behavior is very strange:
This commit keeps the same behavior as upstream (also the same as 6x without GDD, but a correct way). The only thing might need to discuss is: with GDD, do we still need to lock all leafs in QD? I try to find a bad case (not locking), but seems no result yet. cc @d |
5ff0b30
to
a3c1fab
Compare
src/backend/parser/parse_clause.c
Outdated
lockmode = ExclusiveLock; | ||
else | ||
lockmode = RowExclusiveLock; | ||
if (pstate->p_is_insert && gp_enable_global_deadlock_detector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better comment more here, like move the comments (removed by this or) in InitPlan() here.
*/ | ||
if (rel_is_partitioned(rte->relid) && (parsetree->commandType == CMD_INSERT | ||
|| parsetree->commandType == CMD_UPDATE || parsetree->commandType == CMD_DELETE)) | ||
find_all_inheritors(rte->relid, lockmode, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the behavior is not the same as setTargetTable
, we should keep consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the behavior is not the same as
setTargetTable
, we should keep consistent.
You mean adding the conditional statement of gdd or removing commandType
or both ?
d2035ab
to
48602e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note the usage of function rel_is_partitioned
. To recognize a partitioned table, see example:
gpdb/src/backend/commands/indexcmds.c
Lines 2400 to 2409 in 183f886
if (rel_is_partitioned(heapOid)) | |
{ | |
PartitionNode *pn; | |
pn = get_parts(heapOid, 0 /* level */, 0 /* parent */, false /* inctemplate */, | |
true /* includesubparts */); | |
prels = all_partition_relids(pn); | |
} | |
else if (rel_is_child_partition(heapOid)) | |
prels = find_all_inheritors(heapOid, NoLock, NULL); |
|
||
pstate->p_target_relation = parserOpenTable(pstate, relation, RowExclusiveLock, NULL); | ||
Oid relid = pstate->p_target_relation->rd_id; | ||
if (rel_is_partitioned(relid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rel_is_partitioned
only recognizes the top relation of a partitioned table. If the relation is the middle layer partitioned table, it will return false and skip lock its children.
* If the relation is partitioned, we should acquire corresponding locks | ||
* of each leaf partition before entering InitPlan. | ||
*/ | ||
if (rel_is_partitioned(rte->relid) && ((parsetree->commandType == CMD_INSERT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rel_is_partitioned
same here.
And the indent here looks really confused.
@@ -0,0 +1,66 @@ | |||
create extension if not exists gp_inject_fault; | |||
|
|||
create table sales_row (id int, date date, amt decimal(10,2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add tests for the miulti-level partitioned table.
And operate on the middle layer partitioned table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the multi-level partitioned table, we can't insert/update/delete to middle-layer table. So we don't need to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why we can't do it? As I check the upstream, it allows insert/update/delete on the middle layer partitioned table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, so tomorrow I will check the reason.
c617bf1
to
d133653
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also explain why this problem does not occur with concurrent vacuum and DML operations on a heap partitioned table? Why is it specific to AO?
* This will cause a deadlock. | ||
* Similar scenario apply for UPDATE as well. | ||
* We already get all the locks in the parse-analyze stage. | ||
* So we don't need to acquire any locks here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this comment is wrong. All locks are not acquired at parse-analyze stage, see line 1742 above. The change introduced by this patch would cause locks to be acquired in InitPlan()
which is invoked during execution. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this comment is wrong. All locks are not acquired at parse-analyze stage, see line 1742 above. The change introduced by this patch would cause locks to be acquired in
InitPlan()
which is invoked during execution. Am I missing something?
Hi @asimrp
InitPlan():1742 is not the first time to lock the relation. It just open the table to get the result relation.
} | ||
else if (pstate->p_is_insert && gp_enable_global_deadlock_detector) | ||
{ | ||
lockmode = NoLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks tricky. Not acquiring locks and yet accessing the tables sounds very incorrect. Can you please answer the following questions:
- Why is it correct to insert into a child table without acquiring a lock on it?
- What do we gain by not acquiring a lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks tricky. Not acquiring locks and yet accessing the tables sounds very incorrect. Can you please answer the following questions:
- Why is it correct to insert into a child table without acquiring a lock on it?
- What do we gain by not acquiring a lock?
Hi @asimrp just share what is in my mind:
- this is just the same behavior as the 6X now (without this patch)
- Insert on root partition will lock the needed leaf during Executor (on QEs)
Then for your questions:
Why is it correct to insert into a child table without acquiring a lock on it?
It will be locked in QEs, not QD, which is sort of the same behavior as upstream, I spend much time to come up with a bad case, but it seems OK.
What do we gain by not acquiring a lock?
Maybe performance? See the thread: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/wAPKpJzhbpM/m/VEkOeMf_BQAJ
For 6X, this patch keep the same behavior (but correct) seems the safest way?
* following dead lock scenario may happen between INSERT and AppendOnly VACUUM | ||
* drop phase on the partition table: | ||
* | ||
* 1. AO VACUUM drop on QD: acquired AccessExclusiveLock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to mention which command operates on a leaf partition and which one on the root, in this example. My understanding is, vacuum
operates on a leaf, whereas insert
operates on the root.
-- session 1 get locks for the sales_row_1_prt_1 | ||
-- session 2 will get snapshot and enter InitPlan, wait session 1 to release the locks for the sales_row_1_prt_1 | ||
-- session 1 complete | ||
-- session 2 get locks but the snapshot is invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better way to say this: "the version of tuple (in aoseg relation?) created by vacuum is invisible to session2's snapshot". The term "invalid snapshot" does not ring the right bells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please resolve comments ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to merge this fix as it targets a specific problem. Bigger questions with regards to the locking of partitioned tables with and without GDD enabled remain but they should be dealt with separately.
@bbbearxyz @wraymo before merging, please answer the question on multilevel partitioning raised by Junfeng and also incorporate specific feedback already given.
The bigger problems are being tracked in #8362. |
I think we have already solved the problem in this PR because we get locks of each leaf partition before InitPlan following the lock mode of its root partition. |
I will get to reviewing this PR tomorrow. Meanwhile, lets add README file to document current partition locking semantics based on knowledge we have gained as part of this PR. |
-- session 2 will get locks on the sales_row, but still wait locks on the leaf partition in the function AcquirePlannerLocks. | ||
2&: execute test; | ||
|
||
select pg_sleep(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have this sleep in test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we must make true session 2 arrtive the place where session 2 need to get locks. If we don't sleep, this situation may happen: session 1 wait in the fault vacuum_hold_lock
, session 2 is so slow and don't arrive the place where need to get locks and reset the fault, session 1 complete and session2 complete serializablility, the test is not valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, no way to guarantee it will run concurrently even with sleep as completely non-deterministic and times can vary too much in CI. Can we enhance and make it deterministic instead by looping and checking for required locks waiting stage is reached by querying pg_locks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supose that sutiation : A parent table has two child table a and b, When we want to insert into the parent table, we need to acquire parent locks and two child locks, But In the test, vacuum session 1 already get locks on b, we must make sure session 2 wait in the position where need to get locks on b. But we can't make true session 2 wait in the postion. Though we can check pg_locks
for session 2 already get lock on a, we still can't make true session 2 wait in the position get lock on b. It is non-deterministic although the non-deterministic is very small. Do you think it is better or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already know how to solve it. Thank you! I will use fault inject replace sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use inject fault in the place where need to get lock, But still have the problem we just say, we can't make true session 2 is waitting for the child locks exactly, So I use inject fault and sleep to keep non-deterministic smallest.
The PR states it aligns locking behavior with upstream. That's stated in context of what all locks are acquired. Are we aligned with upstream also in terms of when (parse vs initplan time) the locks are acquired? I think this PR also helps to mitigate cases for such kind of queries where actual function execution (deleting the rows for concurrently vacuumed table) is much delayed after taking the snapshot. SELECT pg_sleep(10) FROM dummy_table UNION ALL SELECT delete_rows(); |
This case is that we get snapshot, after ten seconds, we delete rows? In the sql, do you know when we get locks in the function delete_rows()? |
@@ -0,0 +1,42 @@ | |||
src/backend/storage/lmgr/README-partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for starting this README. Is there a way we can represent the information using tabular format or something capturing information about different Operations (I/U/D/DDLs), GDD enabled, GDD disabled, with ORCA, without ORCA. Clearly flagging any deviations from upstream we have in the table as well.
@asimrp would you be interested to work with @bbbearxyz to help out to get it in that shape, I think you were also wondering about it with me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will do it
HI, Ashwin, can you tell clearly where we should add README file to? |
7edefd7
to
fe94298
Compare
…ition ao table The problem is ERROR "tuple concurrently updated" occurs when running delete qeury during ETL as below error. When we concurrently execute an UPDATE/DELETE statement and a VACUUM statement on a partition ao table. UPDATE/DELETE statement is likely to get an out-of-date snapshot after the VACUUM has been processed on one of the leaf partition table. So we fix it by acquiring locks before executing. After we get locks, we dispatch snapshot. So this problem will not happen. If the relation is partitioned, we should acquire corresponding locks on each leaf partition before entering InitPlan. The snapshot will acquire before the InitPlan, so if we wait the locks for a while in the InitPlan. When we get all locks, the snapshot maybe become invalid. So we must acquire lock before entering InitPlan to keep the snapshot valid. In the parse-analyze stage, we should get all locks we needed and also need to maintain the consistency of the locks of the parent table and the child table. Only one situation is special, when the command is to insert, and gdd don't open, the dead lock will happen. for example: Without locking the partition relations on QD when INSERT with Planner the following dead lock scenario may happen between INSERT and AppendOnly VACUUM drop phase on the partition table: 1. AO VACUUM drop on QD: acquired AccessExclusiveLock 2. INSERT on QE: acquired RowExclusiveLock 3. AO VACUUM drop on QE: waiting for AccessExclusiveLock 4. INSERT on QD: waiting for AccessShareLock at ExecutorEnd() 2 blocks 3, 1 blocks 4, 1 and 2 will not release their locks until 3 and 4 proceed. Hence INSERT needs to Lock the partition tables on QD here (before 2) to prevent this dead lock. This will cause a deadlock. But if gdd have already opened, gdd will solve the dead lock. So it is safe to not get locks for partitions. But for the update and delete, we must acquire locks whether the gdd opens or not. So for the normal sql statement, we move the acquiring lock behavior in the InitPlan to setTargetTable. for the cached plan, we add the code acquiring locks on the child table. Co-authored-by: Rui Wang wangru@vmware.com wraymo
The problem is ERROR "tuple concurrently updated" occurs when running
delete qeury during ETL as below error. When we concurrently execute an
UPDATE/DELETE statement and a VACUUM statement on a partition ao table.
UPDATE/DELETE statement is likely to get an out-of-date snapshot after
Here are some reminders before you submit the pull request
make installcheck