Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix writeable CTE assign reader gang to writer slice. #11058

Merged
merged 1 commit into from Nov 26, 2020

Conversation

JunfengYang
Copy link
Contributor

For the gang allocation, we assert the first allocated gang always be the
writer gang. And when recycle, the writer QE aliways be putted at the
top of free list.

Normally, the writer slice will be assign writer gang first when iterate the
slice table. But this is not true for writable CTE (with only one writer gang).
For below statement:

explain (slicetable) WITH updated AS (update tbl set rank = 6 where id = 5 returning
rank) select * from tbl where rank in (select rank from updated);
                        QUERY PLAN
--------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Seq Scan on tbl
         Filter: (hashed SubPlan 1)
         SubPlan 1
           ->  Broadcast Motion 1:3  (slice2; segments: 1)
                 ->  Update on tbl
                       ->  Seq Scan on tbl
                             Filter: (id = 5)
 Slice 0: Dispatcher; root 0; parent -1; gang size 0
 Slice 1: Reader; root 0; parent 0; gang size 3
 Slice 2: Primary Writer; root 0; parent 1; gang size 1

If we sill assigns writer gang to Slice 1 here, the writer process will execute
on reader gang. So, find the writer slice and assign writer gang first.

This fixes the issue #11024

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Communicate in the mailing list if needed
  • Pass make installcheck
  • Review a PR in return to support the community

@asimrp
Copy link
Contributor

asimrp commented Nov 11, 2020

In spite of the detailed description, I couldn't understand the problem until stumbling upon cdbcomponent_allocateIdleQE(). Therein lies the root of the problem:

        /*                                                                                                                                                                                                                                    
         * 1. for entrydb, it's never be writer.                                                                                                                                                                                              
         * 2. for first QE, it must be a writer.                                                                                                                                                                                              
         */
        isWriter = contentId == -1 ? false: (cdbinfo->numIdleQEs == 0 && cdbinfo->numActiveQEs == 0);
        segdbDesc = cdbconn_createSegmentDescriptor(cdbinfo, nextQEIdentifer(cdbinfo->cdbs), isWriter);

It is strange that the function takes into account the segmentType argument only when allocating an idle QE from freelist but ignores it when creating a new QE when freelist is not sufficient. The segmentType parameter indicates whether the QE needs to be a reader or a writer. There is opportunity to utilize this parameter more effectively rather than working around the root of the problem and further diminishing readability of this part of the code.

@JunfengYang
Copy link
Contributor Author

It is strange that the function takes into account the segmentType argument only when allocating an idle QE from freelist but ignores it when creating a new QE when freelist is not sufficient. The segmentType parameter indicates whether the QE needs to be a reader or a writer. There is opportunity to utilize this parameter more effectively rather than working around the root of the problem and further diminishing readability of this part of the code.

Yes, but we have an assumption that the first QE on the free list is the writer. And Pengzhou advised not to break the assumption before. As the workaround is very simple, so I just go with the workaround here.
Let me dig deeper.

@JunfengYang
Copy link
Contributor Author

@asimrp Actually, as I dig into our implementation, the writer gang must be allocated first. For a reader QE, it'll find the writer QE proc and set the writer QE proc as lockHolderProcPtr.

if (lockmethodid == DEFAULT_LOCKMETHOD && locktag->locktag_type != LOCKTAG_TRANSACTION)
{
if (IS_QUERY_EXECUTOR_BACKEND() && !Gp_is_writer)
{
if (lockHolderProcPtr == MyProc)
{
/* Find the guy who should manage our locks */
PGPROC * proc = FindProcByGpSessionId(gp_session_id);
int count = 0;
while(proc==NULL && count < 5)
{
pg_usleep( /* microseconds */ 2000);
count++;
CHECK_FOR_INTERRUPTS();
proc = FindProcByGpSessionId(gp_session_id);
}
if (proc != NULL)
{
elog(DEBUG1,"Found writer proc entry. My Pid %d, his pid %d", MyProc-> pid, proc->pid);
lockHolderProcPtr = proc;
}
else
ereport(FATAL,
(errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
errmsg(WRITER_IS_MISSING_MSG),
errdetail("lock [%u,%u] %s %d. "
"Probably because writer gang is gone somehow. "
"Maybe try rerunning.", locktag->locktag_field1,
locktag->locktag_field2, lock_mode_names[lockmode],
(int)locktag->locktag_type)));
}
}

If reader QE allocated first, the libpq connection will even fail. Since InitializeSessionUserId will try to open pg_authid with AccessShareLock.

Copy link
Contributor

@asimrp asimrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current interfaces between executor and dispatcher, I'm convinced that this is the best possible alternative. Thank you Junfeng for your patience!

For the gang allocation, our current implementation required the first
allocated gang must be the writer gang.
This has several reasons:
- For lock holding, Because of our MPP structure, we assign a LockHolder
  for each segment when executing a query. lockHolder is the gang member that
  should hold and manage locks for this transaction. On the QEs, it should
  normally be the Writer gang member. More details please refer to
  lockHolderProcPtr in lock.c.
- For SharedSnapshot among session's gang processes on a particular segment.
  During initPostgres(), reader QE will try to lookup the shared slot written
  by writer QE. More details please reger to sharedsnapshot.c.

Normally, the writer slice will be assign writer gang first when iterate the
slice table. But this is not true for writable CTE (with only one writer gang).
For below statement:

explain (slicetable) WITH updated AS (update tbl set rank = 6 where id = 5 returning
rank) select * from tbl where rank in (select rank from updated);
                        QUERY PLAN
--------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Seq Scan on tbl
         Filter: (hashed SubPlan 1)
         SubPlan 1
           ->  Broadcast Motion 1:3  (slice2; segments: 1)
                 ->  Update on tbl
                       ->  Seq Scan on tbl
                             Filter: (id = 5)
 Slice 0: Dispatcher; root 0; parent -1; gang size 0
 Slice 1: Reader; root 0; parent 0; gang size 3
 Slice 2: Primary Writer; root 0; parent 1; gang size 1

If we sill assign writer gang to Slice 1 here, the writer process will execute
on reader gang. So, find the writer slice and assign writer gang first.
@JunfengYang JunfengYang merged commit 043b809 into greenplum-db:master Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants