Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quickdie may be invoked on rxThread #11006

Closed
hidva opened this issue Oct 20, 2020 · 5 comments
Closed

quickdie may be invoked on rxThread #11006

hidva opened this issue Oct 20, 2020 · 5 comments
Assignees

Comments

@hidva
Copy link
Contributor

@hidva hidva commented Oct 20, 2020

Greenplum version or build

PostgreSQL 12beta2 (Greenplum Database 7.0.0-alpha.0+dev.13977.ge8521fbc73 build dev) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 6.5.1 20190307 (Alibaba 6.5.1-1 2.17), 64-bit compiled on Oct 20 2020 10:56:27 (with assert checking)

OS version and uname -a

Linux gpdb 4.9.151-015.ali3000.alios7.x86_64 #1 SMP Tue Mar 12 19:10:26 CST 2019 x86_64 x86_64 x86_64 GNU/Linux

autoconf options used ( config.status --config )

'--prefix=/u01/user/zhanyi/DBBIN/DB' '--enable-gpfdist' '--disable-pxf' '--disable-orca' '--enable-gdb' '--enable-debug' '--enable-depend' '--without-zstd' '--enable-cassert' '--disable-gpcloud' 'CFLAGS=-O0 -ggdb -fsanitize=address' 'CXXFLAGS=-O0 -ggdb -fsanitize=address' 'LIBS=-ldl'

Installation information ( pg_config )

$ pg_config
BINDIR = /u01/user/zhanyi/DBBIN/DB/bin
DOCDIR = /u01/user/zhanyi/DBBIN/DB/share/doc/postgresql
HTMLDIR = /u01/user/zhanyi/DBBIN/DB/share/doc/postgresql
INCLUDEDIR = /u01/user/zhanyi/DBBIN/DB/include
PKGINCLUDEDIR = /u01/user/zhanyi/DBBIN/DB/include/postgresql
INCLUDEDIR-SERVER = /u01/user/zhanyi/DBBIN/DB/include/postgresql/server
LIBDIR = /u01/user/zhanyi/DBBIN/DB/lib
PKGLIBDIR = /u01/user/zhanyi/DBBIN/DB/lib/postgresql
LOCALEDIR = /u01/user/zhanyi/DBBIN/DB/share/locale
MANDIR = /u01/user/zhanyi/DBBIN/DB/share/man
SHAREDIR = /u01/user/zhanyi/DBBIN/DB/share/postgresql
SYSCONFDIR = /u01/user/zhanyi/DBBIN/DB/etc/postgresql
PGXS = /u01/user/zhanyi/DBBIN/DB/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/u01/user/zhanyi/DBBIN/DB' '--enable-gpfdist' '--disable-pxf' '--disable-orca' '--enable-gdb' '--enable-debug' '--enable-depend' '--without-zstd' '--enable-cassert' '--disable-gpcloud' 'CFLAGS=-O0 -ggdb -fsanitize=address' 'CXXFLAGS=-O0 -ggdb -fsanitize=address' 'LIBS=-ldl'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -g -ggdb -O0 -ggdb -fsanitize=address  -Werror=uninitialized -Werror=implicit-function-declaration
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/u01/user/zhanyi/DBBIN/DB/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lnuma -lbz2 -lrt -lz -lreadline -lrt -lcrypt -lm -ldl -lcurl
VERSION = PostgreSQL 12beta2

Actual behavior

#33 0x00000000016dc12d in EmitErrorReport () at elog.c:1669
#34 0x00000000016d65d1 in errfinish (filename=0x1db3720 "assert.c", lineno=48, funcname=0x1db37e0 <__func__.7422> "ExceptionalCondition") at elog.c:614
#35 0x00000000016d36fa in ExceptionalCondition (conditionName=0x1e0eae0 "!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext) || ((((const Node*)((context)))->type) == T_SlabContext) || ((((const Node*)((context)))->type) == T_GenerationContext))))", errorType=0x1e0eaa0 "BadArgument", fileName=0x1e0ebe0 "mcxt.c", lineNumber=1143) at assert.c:44
#36 0x0000000001784fc6 in palloc (size=1024) at mcxt.c:1143
#37 0x00000000019548b9 in initStringInfo (str=0x7f1dc95df300) at stringinfo.c:63
#38 0x00000000016d7d6f in errmsg (fmt=0x1d06ce0 "terminating connection because of crash of another server process") at elog.c:1061
#39 0x00000000012f0474 in quickdie (postgres_signal_arg=3) at postgres.c:3467
#40 <signal handler called>
#41 0x00007f1dcf7f827d in poll () from /lib64/libc.so.6
#42 0x00007f1dd14eef64 in poll () from /lib64/libasan.so.3
#43 0x000000000185f096 in testmode_poll (caller_name=0x1e9be80 <__func__.33279> "rxThreadFunc", fds=0x7f1dc95dfb90, nfds=1, timeout=250) at ../../../../src/include/cdb/cdbicudpfaultinjection.h:359
#44 0x000000000187c16f in rxThreadFunc (arg=0x0) at ic_udpifc.c:6183
#45 0x00007f1dcfff3e25 in start_thread () from /lib64/libpthread.so.0
#46 0x00007f1dcf802f1d in clone () from /lib64/libc.so.6

Step to reproduce the behavior

createdb autoanalyzetest1;
createdb autoanalyzetest2;
createdb autoanalyzetest3;
createdb autoanalyzetest4;
createdb autoanalyzetest5;
createdb autoanalyzetest6;

pgbench -i -n -s 1 autoanalyzetest1;
pgbench -i -n -s 1 autoanalyzetest2;
pgbench -i -n -s 1 autoanalyzetest3;
pgbench -i -n -s 1 autoanalyzetest4;
pgbench -i -n -s 1 autoanalyzetest5;
pgbench -i -n -s 1 autoanalyzetest6;

cwd=$(pwd)
for db in autoanalyzetest1 autoanalyzetest2 autoanalyzetest3 autoanalyzetest4 autoanalyzetest5 autoanalyzetest6
do
  for protocol in simple extended prepared
  do
    outputdir=${cwd}/${db}-${protocol}
    mkdir -pv ${outputdir}
    cd ${outputdir}
    nohup pgbench --protocol=${protocol} -c 1 -j 1 -l -n -T 604800 --aggregate-interval=600 ${db} 1>pgbench.stdout.log 2>&1 &
    cd ${cwd}
  done
done

On the 8C32G machine, this can always reproduce the error. If your machine has a higher configuration, then increase the number of clients.

This error can be fixed with some patch as follow:

diff --git a/src/backend/cdb/motion/ic_udpifc.c b/src/backend/cdb/motion/ic_udpifc.c
index 1d93c81c06..b0794c85b8 100644
--- a/src/backend/cdb/motion/ic_udpifc.c
+++ b/src/backend/cdb/motion/ic_udpifc.c
@@ -6142,6 +6142,10 @@ handleDataPacket(MotionConn *conn, icpkthdr *pkt, struct sockaddr_storage *peer,
 static void *
 rxThreadFunc(void *arg)
 {
+       sigset_t blkset;
+       sigfillset(&blkset);
+       int pret = pthread_sigmask(SIG_SETMASK, &blkset, NULL);
+       if (pret != 0) return NULL;
@paul-guo-
Copy link
Member

@paul-guo- paul-guo- commented Oct 20, 2020

Should modify in ic_set_pthread_sigmasks(). Curious why previously we did not see this issue and why previously we just mask part of signals in the pthreads?

@hidva
Copy link
Contributor Author

@hidva hidva commented Oct 20, 2020

Then we should add SIGQUIT to ic_set_pthread_sigmasks()

why previously we did not see this issue

In theory, this issue always exists, but it is difficult to trigger.

why previously we just mask part of signals in the pthreads?

I don't know too

@paul-guo-
Copy link
Member

@paul-guo- paul-guo- commented Oct 20, 2020

Then we should add SIGQUIT to ic_set_pthread_sigmasks()

If we mask all we should do that in ic_set_pthread_sigmasks() also (maybe rename the function) to avoid a potential race:
signal is received in rxThreadFunc before calling pthread_sigmask().

@paul-guo-
Copy link
Member

@paul-guo- paul-guo- commented Oct 21, 2020

It seems that we should and it is safe to mask all signal in the udp pthreads. Please feel free to open a PR.

@hidva
Copy link
Contributor Author

@hidva hidva commented Oct 21, 2020

hidva added a commit to hidva/gpdb that referenced this issue Oct 21, 2020
In some cases, some signals (like SIGQUIT) that should only be
processed by the main thread of the postmaster may be dispatched to rxThread.
So we should and it is safe to block all signals in the udp pthreads.

Fix greenplum-db#11006
paul-guo- added a commit that referenced this issue Oct 28, 2020
In some cases, some signals (like SIGQUIT) that should only be
processed by the main thread of the postmaster may be dispatched to rxThread.
So we should and it is safe to block all signals in the udp pthreads.

Fix #11006
paul-guo- added a commit to paul-guo-/gpdb that referenced this issue Oct 28, 2020
In some cases, some signals (like SIGQUIT) that should only be
processed by the main thread of the postmaster may be dispatched to rxThread.
So we should and it is safe to block all signals in the udp pthreads.

Fix greenplum-db#11006

(cherry picked from commit 54451fc)
paul-guo- added a commit that referenced this issue Nov 4, 2020
In some cases, some signals (like SIGQUIT) that should only be
processed by the main thread of the postmaster may be dispatched to rxThread.
So we should and it is safe to block all signals in the udp pthreads.

Fix #11006

(cherry picked from commit 54451fc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants