Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6X FIX]Fix a bug of concurrently executing UPDATE/DELETE with VACUUM on a partition ao table #11895

Merged
merged 1 commit into from Jul 2, 2021

Conversation

bbbearxyz
Copy link

The problem is ERROR "tuple concurrently updated" occurs when running
delete qeury during ETL as below error. When we concurrently execute an
UPDATE/DELETE statement and a VACUUM statement on a partition ao table.
UPDATE/DELETE statement is likely to get an out-of-date snapshot after

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Communicate in the mailing list if needed
  • Pass make installcheck
  • Review a PR in return to support the community

Copy link
Contributor

@asimrp asimrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no AO tables in upstream PostgreSQL, but there are partitioned tables. How does the locking work for partitioned tables in upstream? It will be a good reference point in order to evaluate this change.

@kainwen
Copy link
Contributor

kainwen commented May 20, 2021

There are no AO tables in upstream PostgreSQL, but there are partitioned tables. How does the locking work for partitioned tables in upstream? It will be a good reference point in order to evaluate this change.

Hi @asimrp

For Update | Delete on root table, upstream will lock all the leafs.

For Insert, upstream's behavior, please refer to the thread: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/wAPKpJzhbpM/m/VEkOeMf_BQAJ


pstate->p_target_relation = parserOpenTable(pstate, relation, RowExclusiveLock, NULL);
Oid relid = RangeVarGetRelid(relation, NoLock, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get relid from pstate->p_target_relation, do not need to get id from rangeVar.

if (parsetree->commandType == CMD_INSERT)
lockmode = RowExclusiveLock;
else
lockmode = ExclusiveLock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lockmode should also be deducted here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lockmode should also be deducted here.

Yeah. We can just use lockmode here. That's easier.

@asimrp
Copy link
Contributor

asimrp commented May 25, 2021

It seems there was consensus to adopt the locking strategy from upstream (lock all leaves): https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/-FELESD8BCc/m/W-wWpb7UAgAJ

Is it possible to do the same on 6X_STABLE? Or such a change would be too intrusive for a stable branch?

@kainwen
Copy link
Contributor

kainwen commented May 25, 2021

It seems there was consensus to adopt the locking strategy from upstream (lock all leaves): https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/-FELESD8BCc/m/W-wWpb7UAgAJ

Is it possible to do the same on 6X_STABLE? Or such a change would be too intrusive for a stable branch?

Hi @asimrp , 6X's behavior is not correct thus leads to this BUG.

Without GDD, 6X locks all the leafs for insert, delete, update; this is just the same behavior as this commit now. Also, previous fix forget to modify the related code for cached plan.

With GDD, 6X's behavior is very strange:

  • with orca, leafs will not be locked on QD
  • with planner, leafs are locked with wrong lock mode for AO partition table

This commit keeps the same behavior as upstream (also the same as 6x without GDD, but a correct way).

The only thing might need to discuss is: with GDD, do we still need to lock all leafs in QD? I try to find a bad case (not locking), but seems no result yet.

cc @d

lockmode = ExclusiveLock;
else
lockmode = RowExclusiveLock;
if (pstate->p_is_insert && gp_enable_global_deadlock_detector) {
Copy link
Contributor

@kainwen kainwen May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better comment more here, like move the comments (removed by this or) in InitPlan() here.

*/
if (rel_is_partitioned(rte->relid) && (parsetree->commandType == CMD_INSERT
|| parsetree->commandType == CMD_UPDATE || parsetree->commandType == CMD_DELETE))
find_all_inheritors(rte->relid, lockmode, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the behavior is not the same as setTargetTable, we should keep consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the behavior is not the same as setTargetTable, we should keep consistent.

You mean adding the conditional statement of gdd or removing commandType or both ?

@bbbearxyz bbbearxyz force-pushed the backup branch 2 times, most recently from d2035ab to 48602e9 Compare May 28, 2021 06:47
Copy link
Contributor

@JunfengYang JunfengYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note the usage of function rel_is_partitioned. To recognize a partitioned table, see example:

if (rel_is_partitioned(heapOid))
{
PartitionNode *pn;
pn = get_parts(heapOid, 0 /* level */, 0 /* parent */, false /* inctemplate */,
true /* includesubparts */);
prels = all_partition_relids(pn);
}
else if (rel_is_child_partition(heapOid))
prels = find_all_inheritors(heapOid, NoLock, NULL);


pstate->p_target_relation = parserOpenTable(pstate, relation, RowExclusiveLock, NULL);
Oid relid = pstate->p_target_relation->rd_id;
if (rel_is_partitioned(relid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rel_is_partitioned only recognizes the top relation of a partitioned table. If the relation is the middle layer partitioned table, it will return false and skip lock its children.

* If the relation is partitioned, we should acquire corresponding locks
* of each leaf partition before entering InitPlan.
*/
if (rel_is_partitioned(rte->relid) && ((parsetree->commandType == CMD_INSERT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rel_is_partitioned same here.
And the indent here looks really confused.

@@ -0,0 +1,66 @@
create extension if not exists gp_inject_fault;

create table sales_row (id int, date date, amt decimal(10,2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add tests for the miulti-level partitioned table.
And operate on the middle layer partitioned table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the multi-level partitioned table, we can't insert/update/delete to middle-layer table. So we don't need to add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why we can't do it? As I check the upstream, it allows insert/update/delete on the middle layer partitioned table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, so tomorrow I will check the reason.

@bbbearxyz bbbearxyz force-pushed the backup branch 5 times, most recently from c617bf1 to d133653 Compare June 2, 2021 11:03
Copy link
Contributor

@asimrp asimrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also explain why this problem does not occur with concurrent vacuum and DML operations on a heap partitioned table? Why is it specific to AO?

* This will cause a deadlock.
* Similar scenario apply for UPDATE as well.
* We already get all the locks in the parse-analyze stage.
* So we don't need to acquire any locks here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this comment is wrong. All locks are not acquired at parse-analyze stage, see line 1742 above. The change introduced by this patch would cause locks to be acquired in InitPlan() which is invoked during execution. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this comment is wrong. All locks are not acquired at parse-analyze stage, see line 1742 above. The change introduced by this patch would cause locks to be acquired in InitPlan() which is invoked during execution. Am I missing something?

Hi @asimrp

InitPlan():1742 is not the first time to lock the relation. It just open the table to get the result relation.

}
else if (pstate->p_is_insert && gp_enable_global_deadlock_detector)
{
lockmode = NoLock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks tricky. Not acquiring locks and yet accessing the tables sounds very incorrect. Can you please answer the following questions:

  1. Why is it correct to insert into a child table without acquiring a lock on it?
  2. What do we gain by not acquiring a lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks tricky. Not acquiring locks and yet accessing the tables sounds very incorrect. Can you please answer the following questions:

  1. Why is it correct to insert into a child table without acquiring a lock on it?
  2. What do we gain by not acquiring a lock?

Hi @asimrp just share what is in my mind:

  1. this is just the same behavior as the 6X now (without this patch)
  2. Insert on root partition will lock the needed leaf during Executor (on QEs)

Then for your questions:

Why is it correct to insert into a child table without acquiring a lock on it?

It will be locked in QEs, not QD, which is sort of the same behavior as upstream, I spend much time to come up with a bad case, but it seems OK.

What do we gain by not acquiring a lock?

Maybe performance? See the thread: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/wAPKpJzhbpM/m/VEkOeMf_BQAJ

For 6X, this patch keep the same behavior (but correct) seems the safest way?

* following dead lock scenario may happen between INSERT and AppendOnly VACUUM
* drop phase on the partition table:
*
* 1. AO VACUUM drop on QD: acquired AccessExclusiveLock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to mention which command operates on a leaf partition and which one on the root, in this example. My understanding is, vacuum operates on a leaf, whereas insert operates on the root.

-- session 1 get locks for the sales_row_1_prt_1
-- session 2 will get snapshot and enter InitPlan, wait session 1 to release the locks for the sales_row_1_prt_1
-- session 1 complete
-- session 2 get locks but the snapshot is invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better way to say this: "the version of tuple (in aoseg relation?) created by vacuum is invisible to session2's snapshot". The term "invalid snapshot" does not ring the right bells.

Copy link
Contributor

@kainwen kainwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please resolve comments ASAP.

Copy link
Contributor

@asimrp asimrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to merge this fix as it targets a specific problem. Bigger questions with regards to the locking of partitioned tables with and without GDD enabled remain but they should be dealt with separately.

@bbbearxyz @wraymo before merging, please answer the question on multilevel partitioning raised by Junfeng and also incorporate specific feedback already given.

@asimrp
Copy link
Contributor

asimrp commented Jun 7, 2021

The bigger problems are being tracked in #8362.

@wraymo
Copy link
Contributor

wraymo commented Jun 7, 2021

The bigger problems are being tracked in #8362.

I think we have already solved the problem in this PR because we get locks of each leaf partition before InitPlan following the lock mode of its root partition.

@ashwinstar
Copy link
Contributor

I will get to reviewing this PR tomorrow. Meanwhile, lets add README file to document current partition locking semantics based on knowledge we have gained as part of this PR.

-- session 2 will get locks on the sales_row, but still wait locks on the leaf partition in the function AcquirePlannerLocks.
2&: execute test;

select pg_sleep(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this sleep in test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must make true session 2 arrtive the place where session 2 need to get locks. If we don't sleep, this situation may happen: session 1 wait in the fault vacuum_hold_lock, session 2 is so slow and don't arrive the place where need to get locks and reset the fault, session 1 complete and session2 complete serializablility, the test is not valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, no way to guarantee it will run concurrently even with sleep as completely non-deterministic and times can vary too much in CI. Can we enhance and make it deterministic instead by looping and checking for required locks waiting stage is reached by querying pg_locks?

Copy link
Author

@bbbearxyz bbbearxyz Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supose that sutiation : A parent table has two child table a and b, When we want to insert into the parent table, we need to acquire parent locks and two child locks, But In the test, vacuum session 1 already get locks on b, we must make sure session 2 wait in the position where need to get locks on b. But we can't make true session 2 wait in the postion. Though we can check pg_locks for session 2 already get lock on a, we still can't make true session 2 wait in the position get lock on b. It is non-deterministic although the non-deterministic is very small. Do you think it is better or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already know how to solve it. Thank you! I will use fault inject replace sleep.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use inject fault in the place where need to get lock, But still have the problem we just say, we can't make true session 2 is waitting for the child locks exactly, So I use inject fault and sleep to keep non-deterministic smallest.

@ashwinstar
Copy link
Contributor

The PR states it aligns locking behavior with upstream. That's stated in context of what all locks are acquired. Are we aligned with upstream also in terms of when (parse vs initplan time) the locks are acquired?

I think this PR also helps to mitigate cases for such kind of queries where actual function execution (deleting the rows for concurrently vacuumed table) is much delayed after taking the snapshot.

SELECT pg_sleep(10) FROM dummy_table UNION ALL SELECT delete_rows();

@bbbearxyz
Copy link
Author

The PR states it aligns locking behavior with upstream. That's stated in context of what all locks are acquired. Are we aligned with upstream also in terms of when (parse vs initplan time) the locks are acquired?

I think this PR also helps to mitigate cases for such kind of queries where actual function execution (deleting the rows for concurrently vacuumed table) is much delayed after taking the snapshot.

SELECT pg_sleep(10) FROM dummy_table UNION ALL SELECT delete_rows();

This case is that we get snapshot, after ten seconds, we delete rows? In the sql, do you know when we get locks in the function delete_rows()?

@@ -0,0 +1,42 @@
src/backend/storage/lmgr/README-partition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for starting this README. Is there a way we can represent the information using tabular format or something capturing information about different Operations (I/U/D/DDLs), GDD enabled, GDD disabled, with ORCA, without ORCA. Clearly flagging any deviations from upstream we have in the table as well.

@asimrp would you be interested to work with @bbbearxyz to help out to get it in that shape, I think you were also wondering about it with me

Copy link
Author

@bbbearxyz bbbearxyz Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will do it

@wenwangBJ
Copy link

I will get to reviewing this PR tomorrow. Meanwhile, lets add README file to document current partition locking semantics based on knowledge we have gained as part of this PR.

HI, Ashwin, can you tell clearly where we should add README file to?

@bbbearxyz bbbearxyz force-pushed the backup branch 2 times, most recently from 7edefd7 to fe94298 Compare June 29, 2021 08:52
…ition ao table

The problem is ERROR "tuple concurrently updated" occurs when running
delete qeury during ETL as below error. When we concurrently execute an
UPDATE/DELETE statement and a VACUUM statement on a partition ao table.
UPDATE/DELETE statement is likely to get an out-of-date snapshot after
the VACUUM has been processed on one of the leaf partition table.
So we fix it by acquiring locks before executing. After we get locks, we
dispatch snapshot. So this problem will not happen.

If the relation is partitioned, we should acquire corresponding locks on
each leaf partition before entering InitPlan. The snapshot will acquire
before
the InitPlan, so if we wait the locks for a while in the InitPlan. When
we
get all locks, the snapshot maybe become invalid. So we must acquire
lock
before entering InitPlan to keep the snapshot valid.
In the parse-analyze stage, we should get all locks we needed and also
need to
maintain the consistency of the locks of the parent table and the child
table.
Only one situation is special, when the command is to insert, and gdd
don't open,
the dead lock will happen.
for example:
Without locking the partition relations on QD when INSERT with Planner
the
following dead lock scenario may happen between INSERT and AppendOnly
VACUUM
drop phase on the partition table:

1. AO VACUUM drop on QD: acquired AccessExclusiveLock
2. INSERT on QE: acquired RowExclusiveLock
3. AO VACUUM drop on QE: waiting for AccessExclusiveLock
4. INSERT on QD: waiting for AccessShareLock at ExecutorEnd()

2 blocks 3, 1 blocks 4, 1 and 2 will not release their locks until 3 and
4 proceed.
Hence INSERT needs to Lock the partition tables on QD here (before 2) to
prevent
this dead lock.

This will cause a deadlock. But if gdd have already opened, gdd will
solve the
dead lock. So it is safe to not get locks for partitions.

But for the update and delete, we must acquire locks whether the gdd
opens or not.

So for the normal sql statement, we move the acquiring lock behavior in
the InitPlan
to setTargetTable. for the cached plan, we add the code acquiring locks
on the child
table.

Co-authored-by: Rui Wang wangru@vmware.com wraymo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants