Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement Doublewrite buffer encryption #3

Open
wants to merge 21 commits into
base: ps-8.0.20-merge
Choose a base branch
from

Conversation

satya-bodapati
Copy link
Collaborator

No description provided.

percona-ysorokin and others added 20 commits May 5, 2020 22:38
…ace encryption)

https://jira.percona.com/browse/PS-6789

Temporarily reverted PS-3822 "InnoDB system tablespace encryption"
https://jira.percona.com/browse/PS-3822
(commit 78b6114)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.

Temporarily disabled the following MTR test cases:
- 'innodb.percona_parallel_dblwr_encrypt'
- 'innodb.percona_sys_tablespace_encrypt'
- 'innodb.percona_sys_tablespace_encrypt_dblwr'
- 'sys_vars.innodb_parallel_dblwr_encrypt_basic'
- 'sys_vars.innodb_sys_tablespace_encrypt_basic'
…b_doublewrite file when innodb_doublewrite is disabled)

https://jira.percona.com/browse/PS-6789

Temporarily reverted PS-3411 "LP #1570682: Parallel doublewrite buffer file created when skip-innodb_doublewrite is set"
https://jira.percona.com/browse/PS-3411
(commit 14318e4)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.
…must crash server on I/O error)

https://jira.percona.com/browse/PS-6789

Temporarily reverted PS-5678 "Parallel doublewrite must crash server on I/O error"
https://jira.percona.com/browse/PS-5678
(commit 0f810d7)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.
…rotation. ALPHA)

https://jira.percona.com/browse/PS-6789

Temporarily reverted 'buf0dblwr.cc' part of the PS-3829 "Innodb key rotation. ALPHA"
https://jira.percona.com/browse/PS-3829
(commit c7f44ee)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.
…d to set O_DIRECT on xb_doublewrite when running MTR test cases)

https://jira.percona.com/browse/PS-6789

Temporarily reverted PS-1068 "Fix bug 1669414 (Failed to set O_DIRECT on xb_doublewrite when running MTR test cases)"
https://jira.percona.com/browse/PS-1068
(commit 7f41824)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.
…lel doublewrite memory not freed with innodb_fast_shutdown=2)

https://jira.percona.com/browse/PS-6789

Temporarily reverted PS-1707 "LP #1578139: Parallel doublewrite memory not freed with innodb_fast_shutdown=2"
https://jira.percona.com/browse/PS-1707
(commit 8a53ed7)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.
… implementation (Implement parallel doublewrite)

https://jira.percona.com/browse/PS-6789

Reverted 'parallel-doublewrite' blueprint implementation "Implement parallel doublewrite"
https://blueprints.launchpad.net/percona-server/+spec/parallel-doublewrite
(commit 4596aaa)
to make parallel doublewrite part of the upstream 8.0.20 merge easier.

Temporarily disabled the following MTR test cases:
- 'sys_vars.innodb_parallel_doublewrite_path_basic'
- 'innodb.percona_doublewrite'
https://jira.percona.com/browse/PS-6789

***
Updated man pages from MySQL Server 8.0.20 source tarball.

***
Updated 'scripts/fill_help_tables.sql' from MySQL Server 8.0.20 source
tarball.
https://jira.percona.com/browse/PS-6789

***
Reverted our fix for PS-6094
"Handler fails to trigger on Error 1049 or SQLSTATE 42000 or plain sqlexception"
(https://jira.percona.com/browse/PS-6094)
(commit 31b5c73)
in favor of the upstream fix for the Bug #30561920 / #97682
"Handler fails to trigger on Error 1049 or SQLSTATE 42000 or plain sqlexception"
(https://bugs.mysql.com/bug.php?id=97682)
(commit mysql/mysql-server@72c6171).

***
Reverted our fix for PS-3630
"LP #1660255: Test innodb.innodb_mysql is unstable"
(https://jira.percona.com/browse/PS-3630)
(commit e0b5050)
in favor of the upstream fix for the Bug #30810572
"FIX INNODB-MYSQL TEST"
(commit mysql/mysql-server@2692669).

***
Reverted our 8.0.17 merge postfix
"PS-5363 (Merge MySQL 8.0.17): fixed regexps in the rpl.rpl_perfschema_threads_processlist_status MTR test case"
(https://jira.percona.com/browse/PS-5363)
(commit 8d7dd4a)
affecting 'rpl.rpl_perfschema_threads_processlist_status' MTR test case
in favor of the changes made by upstream in WL#3549
"Binlog Compression"
(commit mysql/mysql-server@1e5ae34).

***
Reverted our 8.0.18 merge postfix
"PS-5674: gen_lex_token generator reworked"
(https://jira.percona.com/browse/PS-5674)
(commit 214212a)
in favor of the changes made by upstream Bug #30765691
"FREE TOKEN SLOTS ARE EXHAUSTED IN GEN_LEX_TOKEN.CC"
(commit mysql/mysql-server@17ca03f).
'SYM_PERCONA()' macro preserved and made a synonym for upstream's 'SYM()'.
Percona Server 5.7-specific tokens
- CHANGED_PAGE_BITMAPS_SYM
- CLIENT_STATS_SYM
- CLUSTERING_SYM
- COMPRESSION_DICTIONARY_SYM
- INDEX_STATS_SYM
- TABLE_STATS_SYM
- THREAD_STATS_SYM
- USER_STATS_SYM
- ENCRYPTION_KEY_ID_SYM
explicitly assigned values starting from 1300. The same values were assigned
to them implicitly in Percona Server 8.0.19.
Percona Server 8.0-specific tokens
- EFFECTIVE_SYM
- SEQUENCE_TABLE_SYM
explicitly assigned values starting from 1350. This group has different values
than in Percona Server 8.0.19.

***
Similarly to other 'innodb.log_encrypt_<n>' MTR test cases 'innodb.log_encrypt_7'
coming from upstream 8.0.20 cloned into two 'innodb.log_encrypt_7_mk' and
'innodb.log_encrypt_7_rk'.

***
Similarly to other 'innodb.table_encrypt_<n>' MTR test cases 'innodb.table_encrypt_6'
coming from upstream 8.0.20 cloned into three 'innodb.table_encrypt_6',
'keyring_vault.table_encrypt_6' and 'keyring_vault.table_encrypt_6_directory'.

***
VERSION raised to "8.0.20-11".
univ.i version raised to "11".
https://jira.percona.com/browse/PS-6789

In the fix for Bug #30508721
"MTR DOESN'T KEEP TRACK OF THE STATE OF INNODB MONITORS"
(commit mysql/mysql-server@abd33c2)
Oracle extended MTR 'check-testcase' procedure with additional comparison of
data from InnoDB metrics state. They also introduced
'mysql-test/include/innodb_monitor_restore.inc' MTR include file that is
supposed to reset InnoDB monitors to their default state.

'mysql-test/include/innodb_monitor_restore.inc' extended with enabling
Percona-specific monitors, those that are enabled (defined with
'MONITOR_DEFAULT_ON' flag) by default.

Similarly to what was done in the upstream patch
  "SET GLOBAL innodb_monitor_enable=default;"
  "SET GLOBAL innodb_monitor_disable=default;"
  "SET GLOBAL innodb_monitor_reset_all=default;"
statement sequences were substituted with
'--source include/innodb_monitor_restore.inc' all over the test code.

As the result, fixed the following MTR test cases:
- 'innodb.innodb_idle_flush_pct'
- 'innodb.lock_contention_big'
- 'innodb.monitor'
- 'innodb.percona_ahi_partitions'
- 'innodb.percona_changed_page_bmp_flush_5446'
- 'innodb.transportable_tbsp-debug'
- 'innodb_zip.transportable_tbsp_debug_zip'
- 'sys_vars.innodb_monitor_disable_basic'
- 'sys_vars.innodb_monitor_enable_basic'
- 'sys_vars.innodb_monitor_reset_all_basic'
- 'sys_vars.innodb_monitor_reset_basic'
- 'sys_vars.innodb_purge_run_now_basic'
- 'sys_vars.innodb_purge_stop_now_basic'
…ated MTR test cases

https://jira.percona.com/browse/PS-6789

The following MTR test cases re-recorded because of the 'filesort' improvements
introduced in the fix for Oracle's Bug #30776132
"MAKE FILESORT KEYS CONSISTENT BETWEEN FIELDS AND ITEMS"
(commit mysql/mysql-server@6d587a6)
- 'main.pool_of_threads'
- 'main.pool_of_threads_high_prio_tickets'.

The following MTR test cases re-recorded because of the changed execution plan
(more hash joins instead of nested blok loops) introduced in these improvements
Bug #30528604
"DELETE THE PRE-ITERATOR EXECUTOR"
(commit mysql/mysql-server@ef166f8),
Bug #30473261
"CONVERT THE INDEX SUBQUERY ENGINES INTO USING THE ITERATOR EXECUTOR"
(commit mysql/mysql-server@cb4116e)
(commit mysql/mysql-server@629b549)
(commit mysql/mysql-server@5a41fba)
(commit mysql/mysql-server@31bd903)
(commit mysql/mysql-server@75bbe1b)
(commit mysql/mysql-server@6226c1a)
(commit mysql/mysql-server@0b45e96)
(commit mysql/mysql-server@8e45d7e)
(commit mysql/mysql-server@7493ae4)
(commit mysql/mysql-server@a5f60bf)
(commit mysql/mysql-server@609b86e),
Bug #30912972
"ASSERTION `KEYLEN == M_START_KEY.LENGTH' FAILED"
(commit mysql/mysql-server@b28bea5)
- 'audit_log.audit_log_filter_db'
- 'main.pool_of_threads'
- 'main.pool_of_threads_high_prio_tickets'
- 'main.percona_expand_fast_index_creation'
- 'main.percona_sequence_table'
https://jira.percona.com/browse/PS-6789

Re-recorded 'main.bug74778' MTR test case because of the new 'SHOW_ROUTINE'
privilege implemented by Oracle in WL #9049
"Add a dynamic privilege for stored routine backup"
(https://dev.mysql.com/worklog/task/?id=9049)
(commit mysql/mysql-server@3e41e44)
… MTR test case

https://jira.percona.com/browse/PS-6789

Re-recorded 'main.backup_locks_mysqldump' MTR test case because of the new default
'mysqldump' network timeout introduced in the fix for Oracle Bug #30755992 / #98203
"mysql dump sufficiently long network timeout too short"
(https://bugs.mysql.com/bug.php?id=98203)
(commit mysql/mysql-server@1f90fad)
https://jira.percona.com/browse/PS-6789

Re-recorded 'main.bug88797' MTR test case because of the new deprecation
warning introduced in the implementation of WL #13325
"Deprecate VALUES syntax in INSERT ... ON DUPLICATE KEY UPDATE"
(https://dev.mysql.com/worklog/task/?id=13325)
(commit mysql/mysql-server@6f3b9df)
…test cases with explicit binlog positions

https://jira.percona.com/browse/PS-6789

Fixed/re-recorded the following MTR test cases because of the changes in
the implementation of WL percona#3549
"Binlog: compression"
(https://dev.mysql.com/worklog/task/?id=3549)
(commit mysql/mysql-server@1e5ae34)
that caused increasing 'Format_description_event' binlog event size and
therefore some pre-recorded binary log positions in the '.result' files.
- 'main.ackup_safe_binlog_info'
- 'main.mysqldump-max'
- 'binlog.percona_binlog_consistent_mixed'
- 'binlog.percona_binlog_consistent_row'
- 'binlog.percona_binlog_consistent_stmt'
- 'binlog.percona_binlog_consistent_debug'
…space encryption)

https://jira.percona.com/browse/PS-6789

1. Re-enable system tablespace encryption again after 8.0.20 upstream merge
   that has new parallel doublewrite implementation
   (https://jira.percona.com/browse/PS-3822).
2. Removed 'innodb.percona_sys_tablespace_encrypt_dblwr' MTR test case as
   there is no doublewrite buffer in system tablespace anymore.
…29 (Innodb key rotation. ALPHA)

https://jira.percona.com/browse/PS-6789

Restored 'buf0dblwr.cc' part of the PS-3829 "Innodb key rotation. ALPHA"
https://jira.percona.com/browse/PS-3829
(commit c7f44ee)
after upstream 8.0.20 merge.

The following MTR test cases do not crash anymore
- 'encryption.upgrade_crypt_data_57_v1'
- 'encryption.upgrade_crypt_data_v1'
- 'innodb.innodb_scrub'
- 'main.percona_dd_upgrade_encrypted'
…in.percona_signal_handling_threadpool MTR test cases

https://jira.percona.com/browse/PS-6789

Fixed and re-recorded 'main.percona_signal_handling' and
'main.percona_signal_handling_threadpool' MTR test in response to the changes
in the Bug #30578923
"SENDING SIGHUP CAUSES A LOT OF GARBAGE TO BE PRINTED"
(commit mysql/mysql-server@b90a1b3).
Removed "Status information:" log section is now simulated via
'DBUG_EXECUTE_IF()'. MTR test cases made debug-only.
storage/innobase/include/os0enc.h Outdated Show resolved Hide resolved
storage/innobase/buf/buf0dblwr.cc Outdated Show resolved Hide resolved
storage/innobase/buf/buf0dblwr.cc Outdated Show resolved Hide resolved
storage/innobase/buf/buf0dblwr.cc Outdated Show resolved Hide resolved
storage/innobase/include/os0file.h Outdated Show resolved Hide resolved
storage/innobase/os/os0enc.cc Outdated Show resolved Hide resolved
storage/innobase/os/os0enc.cc Outdated Show resolved Hide resolved
storage/innobase/os/os0enc.cc Outdated Show resolved Hide resolved
storage/innobase/os/os0enc.cc Outdated Show resolved Hide resolved
@satya-bodapati satya-bodapati changed the title Doublewrite buffer encryption Re-implement Doublewrite buffer encryption Jun 11, 2020
@satya-bodapati
Copy link
Collaborator Author

./mtr --mem innodb.percona_parallel_dblwr_encrypt{,,,} --parallel=4 --repeat=20
Logging: /home/satya/WORK/ps-8.0.20-merge/mysql-test/mysql-test-run.pl --mem innodb.percona_parallel_dblwr_encrypt innodb.percona_parallel_dblwr_encrypt innodb.percona_parallel_dblwr_encrypt innodb.percona_parallel_dblwr_encrypt --parallel=4 --repeat=20
MySQL Version 8.0.20

[ 93%] innodb.percona_parallel_dblwr_encrypt w4 [ pass ] 11147
[ 95%] innodb.percona_parallel_dblwr_encrypt w2 [ pass ] 11241
[ 96%] innodb.percona_parallel_dblwr_encrypt w3 [ pass ] 10978
[ 97%] innodb.percona_parallel_dblwr_encrypt w1 [ pass ] 10974
[ 98%] innodb.percona_parallel_dblwr_encrypt w4 [ pass ] 10765
[100%] innodb.percona_parallel_dblwr_encrypt w2 [ pass ] 10931

The servers were restarted 76 times
The servers were reinitialized 0 times
Spent 1837.889 of 794 seconds executing testcases

Completed: All 80 tests were successful.

percona-ysorokin pushed a commit that referenced this pull request Sep 17, 2020
…o: object '/lib64/libtirpc.so' from LD_PRELOAD cannot be preloaded

Problem
=======
Running mtr with ASAN build on Gentoo tests fails since the path to
libtirpc is not /lib64/libtirpc.so which is the path mtr uses for
preloading the library.

Further more the libasan path in Gentoo may contain also underscores and
minus which mtr safe_process does not recognize.

Fails on Gentoo since /lib64/libtirpc.so do not exist
+ERROR: ld.so: object '/lib64/libtirpc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

Fails on Gentoo since /usr/lib64/libtirpc.so is a GNU LD script
+ERROR: ld.so: object '/usr/lib64/libtirpc.so' from LD_PRELOAD cannot be preloaded (invalid ELF header): ignored.

Need to preload /lib64/libtirpc.so.3 on gentoo.

When compiling with GNU C++ libasan path also include minus and underscores:

$ less mysql-test/lib/My/SafeProcess/ldd_asan_test_result
        linux-vdso.so.1 (0x00007ffeba962000)
        libasan.so.4 => /usr/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libasan.so.4 (0x00007f3c2e827000)

Tests that been affected in different ways are for example:

$ ./mtr group_replication.gr_clone_integration_clone_not_installed
[100%] group_replication.gr_clone_integration_clone_not_installed w3  [ fail ]
...
ERROR: ld.so: object '/usr/lib/gcc/x86' from LD_PRELOAD cannot be preloaded
(cannot open shared object file): ignored.
ERROR: ld.so: object '/lib64/libtirpc.so' from LD_PRELOAD cannot be preloaded
(cannot open shared object file): ignored.
mysqltest: At line 21: Query 'START GROUP_REPLICATION' failed.
ERROR 2013 (HY000): Lost connection to MySQL server during query
...
ASAN:DEADLYSIGNAL
=================================================================
==11970==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc
0x7f0e5cecfb8c bp 0x7f0e340f1650 sp 0x7f0e340f0dc8 T44)
==11970==The signal is caused by a READ memory access.
==11970==Hint: address points to the zero page.
    #0 0x7f0e5cecfb8b in xdr_uint32_t (/lib64/libc.so.6+0x13cb8b)
    #1 0x7f0e5fbe6d43
(/usr/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libasan.so.4+0x87d43)
    #2 0x7f0e3c675e59 in xdr_node_no
plugin/group_replication/libmysqlgcs/xdr_gen/xcom_vp_xdr.c:88
    #3 0x7f0e3c67744d in xdr_pax_msg_1_6
plugin/group_replication/libmysqlgcs/xdr_gen/xcom_vp_xdr.c:852
...

$ ./mtr ndb.ndb_config
[100%] ndb.ndb_config                             [ fail ]
...
 --- /.../src/mysql-test/suite/ndb/r/ndb_config.result 2019-06-25
21:19:08.308997942 +0300
 +++ /.../bld/mysql-test/var/log/ndb_config.reject     2019-06-26
11:58:11.718512944 +0300
@@ -30,16 +30,22 @@
 == 16 == bug44689
 192.168.0.1 192.168.0.2 192.168.0.3 192.168.0.4 192.168.0.1 192.168.0.1
 == 17 == bug49400
+ERROR: ld.so: object '/usr/lib/gcc/x86' from LD_PRELOAD cannot be preloaded
(cannot open shared object file): ignored.
+ERROR: ld.so: object '/lib64/libtirpc.so' from LD_PRELOAD cannot be
preloaded (cannot open shared object file): ignored.
  ERROR    -- at line 25: TCP connection is a duplicate of the existing TCP
link from line 14
  ERROR    -- at line 25: Could not store section of configuration file.

$ ./mtr ndb.ndb_basic
[100%] ndb.ndb_basic                             [ pass ]  34706
ERROR: ld.so: object '/usr/lib/gcc/x86' from LD_PRELOAD cannot be preloaded
(cannot open shared object file): ignored.
ERROR: ld.so: object '/lib64/libtirpc.so' from LD_PRELOAD cannot be preloaded
(cannot open shared object file): ignored.

Solution
========
In safe_process use same trick for libtirpc as for libasan to determine
path to library for pre loading.

Also allow underscores and minus in paths.

In addition also add some memory leak suppressions for perl.

Change-Id: Ia02e354a20cf8b279eb2573f3f8c2c39776343dc
(cherry picked from commit e88706d)
percona-ysorokin pushed a commit that referenced this pull request Feb 18, 2021
To call a service implementation one needs to:
1. query the registry to get a reference to the service needed
2. call the service via the reference
3. call the registry to release the reference

While #2 is very fast (just a function pointer call) #1 and #3 can be
expensive since they'd need to interact with the registry's global
structure in a read/write fashion.

Hence if the above sequence is to be repeated in a quick succession it'd
be beneficial to do steps #1 and #3 just once and aggregate as many #2
steps in a single sequence.

This will usually mean to cache the service reference received in #1 and
delay 3 for as much as possible.

But since there's an active reference held to the service implementation
until 3 is taken special handling is needed to make sure that:

The references are released at regular intervals so changes in the
registry
can become effective. There is a way to mark a service implementation
as "inactive" ("dying") so that until all of the active references to it
are released no new ones are possible.

All of the above is part of the current audit API machinery, but needs
to be isolated into a separate service suite and made generally
available to
all services.

This is what this worklog aims to implement.

RB#24806
percona-ysorokin pushed a commit that referenced this pull request Feb 18, 2021
TABLESPACE STATE DOES NOT CHANGE THE SPACE TO EMPTY

After the commit for Bug#31991688, it was found that an idle system may
not ever get around to truncating an undo tablespace when it is SET INACTIVE.
Actually, it takes about 128 seconds before the undo tablespace is finally
truncated.

There are three main tasks for the function trx_purge().
1) Process the undo logs and apply changes to the data files.
   (May be multiple threads)
2) Clean up the history list by freeing old undo logs and rollback
   segments.
3) Truncate undo tablespaces that have grown too big or are SET INACTIVE
   explicitly.

Bug#31991688 made sure that steps 2 & 3 are not done too often.
Concentrating this effort keeps the purge lag from growing too large.
By default, trx_purge() does step#1 128 times before attempting steps
#2 & #3 which are called 'truncate' steps.  This is set by the setting
innodb_purge_rseg_truncate_frequency.

On an idle system, trx_purge() is called once per second if it has nothing
to do in step 1.  After 128 seconds, it will finally do steps 2 (truncating
the undo logs and rollback segments which reduces the history list to zero)
and step 3 (truncating any undo tablespaces that need it).

The function that the purge coordinator thread uses to make these repeated
calls to trx_purge() is called srv_do_purge(). When trx_purge() returns
having done nothing, srv_do_purge() returns to srv_purge_coordinator_thread()
which will put the purge thread to sleep.  It is woke up again once per
second by the master thread in srv_master_do_idle_tasks() if not sooner
by any of several of other threads and activities.

This is how an idle system can wait 128 seconds before the truncate steps
are done and an undo tablespace that was SET INACTIVE can finally become
'empty'.

The solution in this patch is to modify srv_do_purge() so that if trx_purge()
did nothing and there is an undo space that was explicitly set to inactive,
it will immediately call trx_purge again with do_truncate=true so that steps
#2 and #3 will be done.

This does not affect the effort by Bug#31991688 to keep the purge lag from
growing too big on sysbench UPDATE NO_KEY. With this change, the purge lag
has to be zero and there must be a pending explicit undo space truncate
before this extra call to trx_purge is done.

Approved by Sunny in RB#25311
percona-ysorokin pushed a commit that referenced this pull request Feb 18, 2021
…TH VS 2019 [#3] [noclose]

storage\ndb\src\common\portlib\NdbThread.cpp(1240,3): warning C4805: '==': unsafe mix of type 'int' and type 'bool' in operation

Change-Id: I33e3ff9845f3d3e496f64401d30eaa9b992da594
percona-ysorokin pushed a commit that referenced this pull request Jul 8, 2021
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
percona-ysorokin pushed a commit that referenced this pull request Mar 11, 2022
…close]

Make the range optimizer return AccessPaths instead of TABLE_READ_PLAN.
This is the first step of getting rid of TABLE_READ_PLAN and moving
everything into AccessPath; currently, it's just a very thin shell:

 1. TRPs are still used internally, and AccessPath is created
    at the very end.
 2. Child TRPs are still child TRPs (ie., there are no child
    AccessPaths).
 3. All returned AccessPaths are still of the type INDEX_RANGE_SCAN,
    wrapping a TRP.
 4. Some callers still reach directly into the TRP, assuming #3.

Most callers (save for the aforemented #4) use a set of simple wrapper
functions to access TRP-derived properties from AccessPaths; as we
continue the transformation, this is the main place we'll change the
interaction (ie., most of the calling code will remain unchanged).

Change-Id: I3d9dc9e33c53d1e5124ea9c47b7d6d9270cd1906
percona-ysorokin pushed a commit that referenced this pull request Mar 11, 2022
This error happens for queries such as:

SELECT ( SELECT 1 FROM t1 ) AS a,
  ( SELECT a FROM ( SELECT x FROM t1 ORDER BY a ) AS d1 );

Query_block::prepare() for query block #4 (corresponding to the 4th
SELECT in the query above) calls setup_order() which again calls
find_order_in_list(). That function replaces an Item_ident for 'a' in
Query_block.order_list with an Item_ref pointing to query block #2.
Then Query_block::merge_derived() merges query block #4 into query
block #3. The Item_ref mentioned above is then moved to the order_list
of query block #3.

In the next step, find_order_in_list() is called for query block #3.
At this point, 'a' in the select list has been resolved to another
Item_ref, also pointing to query block #2. find_order_in_list()
detects that the Item_ref in the order_list is equivalent to the
Item_ref in the select list, and therefore decides to replace the
former with the latter. Then find_order_in_list() calls
Item::clean_up_after_removal() recursively (via Item::walk()) for the
order_list Item_ref (since that is no longer needed).

When calling clean_up_after_removal(), no
Cleanup_after_removal_context object is passed. This is the actual
error, as there should be a context pointing to query block #3 that
ensures that clean_up_after_removal() only purge Item_subselect.unit
if both of the following conditions hold:

1) The Item_subselect should not be in any of the Item trees in the
   select list of query block #3.

2) Item_subselect.unit should be a descendant of query block #3.

These conditions ensure that we only purge Item_subselect.unit if we
are sure that it is not needed elsewhere. But without the right
context, query block #2 gets purged even if it is used in the select
lists of query blocks #1 and #3.

The fix is to pass a context (for query block #3) to clean_up_after_removal().
Both of the above conditions then become false, and Item_subselect.unit is
not purged. As an additional shortcut, find_order_in_list() will not call
clean_up_after_removal() if real_item() of the order item and the select
list item are identical.

In addition, this commit changes clean_up_after_removal() so that it
requires the context to be non-null, to prevent similar errors. It
also simplifies Item_sum::clean_up_after_removal() by removing window
functions unconditionally (and adds a corresponding test case).

Change-Id: I449be15d369dba97b23900d1a9742e9f6bad4355
percona-ysorokin pushed a commit that referenced this pull request May 18, 2022
*Problem:*

ASAN complains about stack-buffer-overflow on function `mysql_heartbeat`:

```
==90890==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fe746d06d14 at pc 0x7fe760f5b017 bp 0x7fe746d06cd0 sp 0x7fe746d06478
WRITE of size 24 at 0x7fe746d06d14 thread T16777215

Address 0x7fe746d06d14 is located in stack of thread T26 at offset 340 in frame
    #0 0x7fe746d0a55c in mysql_heartbeat(void*) /home/yura/ws/percona-server/plugin/daemon_example/daemon_example.cc:62

  This frame has 4 object(s):
    [48, 56) 'result' (line 66)
    [80, 112) '_db_stack_frame_' (line 63)
    [144, 200) 'tm_tmp' (line 67)
    [240, 340) 'buffer' (line 65) <== Memory access at offset 340 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
Thread T26 created by T25 here:
    #0 0x7fe760f5f6d5 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x557ccbbcb857 in my_thread_create /home/yura/ws/percona-server/mysys/my_thread.c:104
    #2 0x7fe746d0b21a in daemon_example_plugin_init /home/yura/ws/percona-server/plugin/daemon_example/daemon_example.cc:148
    #3 0x557ccb4c69c7 in plugin_initialize /home/yura/ws/percona-server/sql/sql_plugin.cc:1279
    #4 0x557ccb4d19cd in mysql_install_plugin /home/yura/ws/percona-server/sql/sql_plugin.cc:2279
    #5 0x557ccb4d218f in Sql_cmd_install_plugin::execute(THD*) /home/yura/ws/percona-server/sql/sql_plugin.cc:4664
    #6 0x557ccb47695e in mysql_execute_command(THD*, bool) /home/yura/ws/percona-server/sql/sql_parse.cc:5160
    #7 0x557ccb47977c in mysql_parse(THD*, Parser_state*, bool) /home/yura/ws/percona-server/sql/sql_parse.cc:5952
    percona#8 0x557ccb47b6c2 in dispatch_command(THD*, COM_DATA const*, enum_server_command) /home/yura/ws/percona-server/sql/sql_parse.cc:1544
    percona#9 0x557ccb47de1d in do_command(THD*) /home/yura/ws/percona-server/sql/sql_parse.cc:1065
    percona#10 0x557ccb6ac294 in handle_connection /home/yura/ws/percona-server/sql/conn_handler/connection_handler_per_thread.cc:325
    percona#11 0x557ccbbfabb0 in pfs_spawn_thread /home/yura/ws/percona-server/storage/perfschema/pfs.cc:2198
    percona#12 0x7fe760ab544f in start_thread nptl/pthread_create.c:473
```

The reason is that `my_thread_cancel` is used to finish the daemon thread. This is not and orderly way of finishing the thread. ASAN does not register the stack variables are not used anymore which generates the error above.

This is a benign error as all the variables are on the stack.

*Solution*:

Finish the thread in orderly way by using a signalling variable.
percona-ysorokin pushed a commit that referenced this pull request Jul 5, 2022
…ILER WARNINGS

Remove stringop-truncation warning in ndb_config.cpp by refactoring.

Change-Id: I1eea7fe190926a85502e73ca7ebf07d984af9a09
percona-ysorokin pushed a commit that referenced this pull request Aug 2, 2022
**Problem:**

The tests fail under ASAN:

```
==470513==ERROR: AddressSanitizer: heap-use-after-free on address 0x632000054e20 at pc 0x556599b68016 bp 0x7ffc630afb30 sp 0x7ffc630afb20
READ of size 8 at 0x632000054e20 thread T0
    #0 0x556599b68015 in destroy_rwlock(PFS_rwlock*) /tmp/ps/storage/perfschema/pfs_instr.cc:430
    #1 0x556599b30b82 in pfs_destroy_rwlock_v2(PSI_rwlock*) /tmp/ps/storage/perfschema/pfs.cc:2596
    #2 0x7fa44336d62e in inline_mysql_rwlock_destroy /tmp/ps/include/mysql/psi/mysql_rwlock.h:289
    #3 0x7fa44336da39 in vtoken_lock_cleanup::~vtoken_lock_cleanup() /tmp/ps/plugin/version_token/version_token.cc:517
    #4 0x7fa46a7188a6 in __run_exit_handlers /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:108
    #5 0x7fa46a718a5f in __GI_exit /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:139
    #6 0x556596531da2 in mysqld_exit /tmp/ps/sql/mysqld.cc:2512
    #7 0x55659655d579 in mysqld_main(int, char**) /tmp/ps/sql/mysqld.cc:8505
    percona#8 0x55659609c5b5 in main /tmp/ps/sql/main.cc:25
    percona#9 0x7fa46a6f6082 in __libc_start_main ../csu/libc-start.c:308
    percona#10 0x55659609c4ed in _start (/tmp/results/PS/runtime_output_directory/mysqld+0x3c1b4ed)

0x632000054e20 is located 50720 bytes inside of 90112-byte region [0x632000048800,0x63200005e800)
freed by thread T0 here:
    #0 0x7fa46b5f940f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:122
    #1 0x556599b617eb in pfs_free(PFS_builtin_memory_class*, unsigned long, void*) /tmp/ps/storage/perfschema/pfs_global.cc:113
    #2 0x556599b61a15 in pfs_free_array(PFS_builtin_memory_class*, unsigned long, unsigned long, void*) /tmp/ps/storage/perfschema/pfs_global.cc:177
    #3 0x556599b6f28b in PFS_buffer_default_allocator<PFS_rwlock>::free_array(PFS_buffer_default_array<PFS_rwlock>*) /tmp/ps/storage/perfschema/pfs_buffer_container.h:172
    #4 0x556599b75628 in PFS_buffer_scalable_container<PFS_rwlock, 1024, 1024, PFS_buffer_default_array<PFS_rwlock>, PFS_buffer_default_allocator<PFS_rwlock> >::cleanup() /tmp/ps/storage/perfschema/pfs_buffer_container.h:452
    #5 0x556599b6d591 in cleanup_instruments() /tmp/ps/storage/perfschema/pfs_instr.cc:231
    #6 0x556599b8c3f1 in cleanup_performance_schema /tmp/ps/storage/perfschema/pfs_server.cc:343
    #7 0x556599b8dcfc in shutdown_performance_schema() /tmp/ps/storage/perfschema/pfs_server.cc:374
    percona#8 0x556596531d96 in mysqld_exit /tmp/ps/sql/mysqld.cc:2500
    percona#9 0x55659655d579 in mysqld_main(int, char**) /tmp/ps/sql/mysqld.cc:8505
    percona#10 0x55659609c5b5 in main /tmp/ps/sql/main.cc:25
    percona#11 0x7fa46a6f6082 in __libc_start_main ../csu/libc-start.c:308

previously allocated by thread T0 here:
    #0 0x7fa46b5fa6e5 in __interceptor_posix_memalign ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:217
    #1 0x556599b6167e in pfs_malloc(PFS_builtin_memory_class*, unsigned long, int) /tmp/ps/storage/perfschema/pfs_global.cc:68
    #2 0x556599b6187a in pfs_malloc_array(PFS_builtin_memory_class*, unsigned long, unsigned long, int) /tmp/ps/storage/perfschema/pfs_global.cc:155
    #3 0x556599b6fa9e in PFS_buffer_default_allocator<PFS_rwlock>::alloc_array(PFS_buffer_default_array<PFS_rwlock>*) /tmp/ps/storage/perfschema/pfs_buffer_container.h:159
    #4 0x556599b6ff12 in PFS_buffer_scalable_container<PFS_rwlock, 1024, 1024, PFS_buffer_default_array<PFS_rwlock>, PFS_buffer_default_allocator<PFS_rwlock> >::allocate(pfs_dirty_state*) /tmp/ps/storage/perfschema/pfs_buffer_container.h:602
    #5 0x556599b69abc in create_rwlock(PFS_rwlock_class*, void const*) /tmp/ps/storage/perfschema/pfs_instr.cc:402
    #6 0x556599b341f5 in pfs_init_rwlock_v2(unsigned int, void const*) /tmp/ps/storage/perfschema/pfs.cc:2578
    #7 0x556599b9487b in inline_mysql_rwlock_init /tmp/ps/include/mysql/psi/mysql_rwlock.h:261
    percona#8 0x556599b94ba7 in init_pfs_tls_channels_instrumentation() /tmp/ps/storage/perfschema/pfs_tls_channel.cc:209
    percona#9 0x556599b8ca44 in initialize_performance_schema(PFS_global_param*, PSI_thread_bootstrap**, PSI_mutex_bootstrap**, PSI_rwlock_bootstrap**, PSI_cond_bootstrap**, PSI_file_bootstrap**, PSI_socket_bootstrap**, PSI_table_bootstrap**, PSI_mdl_bootstrap**, PSI_idle_bootstrap**, PSI_stage_bootstrap**, PSI_statement_bootstrap**, PSI_transaction_bootstrap**, PSI_memory_bootstrap**, PSI_error_bootstrap**, PSI_data_lock_bootstrap**, PSI_system_bootstrap**, PSI_tls_channel_bootstrap**) /tmp/ps/storage/perfschema/pfs_server.cc:266
    percona#10 0x55659655a585 in mysqld_main(int, char**) /tmp/ps/sql/mysqld.cc:7497
    percona#11 0x55659609c5b5 in main /tmp/ps/sql/main.cc:25
    percona#12 0x7fa46a6f6082 in __libc_start_main ../csu/libc-start.c:308

SUMMARY: AddressSanitizer: heap-use-after-free /tmp/ps/storage/perfschema/pfs_instr.cc:430 in destroy_rwlock(PFS_rwlock*)
Shadow bytes around the buggy address:
  0x0c6480002970: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002980: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002990: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029a0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0c64800029c0: fd fd fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029f0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002a00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002a10: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==470513==ABORTING
```

The reason of the error is Percona's change on
5ae4d27 which causes the static
variables of the plugin not to be deallocated.

This causes `void cleanup_instruments()` to be called before
`vtoken_lock_cleanup::~vtoken_lock_cleanup()`, which finds
the memory of the object to have been deallocated.

**Solution:**

Do not run the tests under ASAN or Valgrind.
percona-ysorokin pushed a commit that referenced this pull request Aug 2, 2022
**Problem:**

The following leak is detected when running the test
`encryption.upgrade_crypt_data_57_v1`:

```
==388399==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 70 byte(s) in 1 object(s) allocated from:
    #0 0x7f5f87812808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x55f098875d2c in ut::detail::malloc(unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/allocator_traits.h:71
    #2 0x55f098875db5 in ut::detail::Alloc_fn::malloc(unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/allocator_traits.h:88
    #3 0x55f0988aa4b9 in void* ut::detail::Alloc_fn::alloc<false>(unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/allocator_traits.h:97
    #4 0x55f09889b7a3 in void* ut::detail::Alloc_pfs::alloc<false>(unsigned long, unsigned int) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/alloc.h:275
    #5 0x55f09889bb9a in std::enable_if<ut::detail::Alloc_pfs::is_pfs_instrumented_v, void*>::type ut::detail::Alloc_<ut::detail::Alloc_pfs>::alloc<false, ut::detail::Alloc_pfs>(unsigned long, unsigned int) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/alloc.h:438
    #6 0x55f0988767dd in ut::malloc_withkey(ut::PSI_memory_key_t, unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/ut0new.h:604
    #7 0x55f09937dd3c in rec_copy_prefix_to_buf_old /home/ldonoso/src/release-8.0.29-20/storage/innobase/rem/rem0rec.cc:1206
    percona#8 0x55f09937dfd3 in rec_copy_prefix_to_buf(unsigned char const*, dict_index_t const*, unsigned long, unsigned char**, unsigned long*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/rem/rem0rec.cc:1233
    percona#9 0x55f098ae0ae3 in dict_index_copy_rec_order_prefix(dict_index_t const*, unsigned char const*, unsigned long*, unsigned char**, unsigned long*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0dict.cc:3764
    percona#10 0x55f098c3d0ba in btr_pcur_t::store_position(mtr_t*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/btr/btr0pcur.cc:141
    percona#11 0x55f098c027b6 in dict_getnext_system_low /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:256
    percona#12 0x55f098c02933 in dict_getnext_system(btr_pcur_t*, mtr_t*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:298
    percona#13 0x55f098c0c05b in dict_check_sys_tables /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:1573
    percona#14 0x55f098c1770d in dict_load_tablespaces_for_upgrade() /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:3233
    percona#15 0x55f0987e9ed1 in innobase_init_files /home/ldonoso/src/release-8.0.29-20/storage/innobase/handler/ha_innodb.cc:6072
    percona#16 0x55f098819ed3 in innobase_ddse_dict_init /home/ldonoso/src/release-8.0.29-20/storage/innobase/handler/ha_innodb.cc:13985
    percona#17 0x55f097fa5c10 in dd::bootstrap::DDSE_dict_init(THD*, dict_init_mode_t, unsigned int) /home/ldonoso/src/release-8.0.29-20/sql/dd/impl/bootstrap/bootstrapper.cc:742
    percona#18 0x55f0986696a6 in dd::upgrade_57::do_pre_checks_and_initialize_dd(THD*) /home/ldonoso/src/release-8.0.29-20/sql/dd/upgrade_57/upgrade.cc:922
    percona#19 0x55f09550e082 in handle_bootstrap /home/ldonoso/src/release-8.0.29-20/sql/bootstrap.cc:327
    percona#20 0x55f0997416e7 in pfs_spawn_thread /home/ldonoso/src/release-8.0.29-20/storage/perfschema/pfs.cc:2943
    percona#21 0x7f5f876a1608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477

SUMMARY: AddressSanitizer: 70 byte(s) leaked in 1 allocation(s).
```

**Solution:**

The cause of the leak raises from the traversing of `pcur`. When
traversing is exhausted `pcur.close()` is automatically called and all
`pcur` resources are deallocated.

Percona adds some early returns to the traverse, hence sometimes the
traversing is not exhausted and `pcur.close()` is not called.

The solution is calling `pcur.close()` explicitly. `close()` is an
idempotent function so it is not a bug if it is called several times as
a result of this change.
percona-ysorokin pushed a commit that referenced this pull request Sep 5, 2022
Remove duplicated NdbEventOperationImpl::m_eventId which is only used in
some printouts.

Change-Id: Id494e17e3a483a8d049e9aaeb9f41bd6d4ccd847
percona-ysorokin pushed a commit that referenced this pull request Sep 5, 2022
-- Patch #1: Persist secondary load information --

Problem:
We need a way of knowing which tables were loaded to HeatWave after
MySQL restarts due to a crash or a planned shutdown.

Solution:
Add a new "secondary_load" flag to the `options` column of mysql.tables.
This flag is toggled after a successful secondary load or unload. The
information about this flag is also reflected in
INFORMATION_SCHEMA.TABLES.CREATE_OPTIONS.

-- Patch #2 --

The second patch in this worklog triggers the table reload from InnoDB
after MySQL restart.

The recovery framework recognizes that the system restarted by checking
whether tables are present in the Global State. If there are no tables
present, the framework will access the Data Dictionary and find which
tables were loaded before the restart.

This patch introduces the "Data Dictionary Worker" - a MySQL service
recovery worker whose task is to query the INFORMATION_SCHEMA.TABLES
table from a separate thread and find all tables whose secondary_load
flag is set to 1.

All tables that were found in the Data Dictionary will be appended to
the list of tables that have to be reloaded by the framework from
InnoDB.

If an error occurs during restart recovery we will not mark the recovery
as failed. This is done because the types of failures that can occur
when the tables are reloaded after a restart are less critical compared
to previously existing recovery situations. Additionally, this code will
soon have to be adapted for the next worklog in this area so we are
proceeding with the simplest solution that makes sense.

A Global Context variable m_globalStateEmpty is added which indicates
whether the Global State should be recovered from an external source.

-- Patch #3 --

This patch adds the "rapid_reload_on_restart" system variable. This
variable is used to control whether tables should be reloaded after a
restart of mysqld or the HeatWave plugin. This variable is persistable
(i.e., SET PERSIST RAPID_RELOAD_ON_RESTART = TRUE/FALSE).

The default value of this variable is set to false.

The variable can be modified in OFF, IDLE, and SUSPENDED states.

-- Patch #4 --

This patch refactors the recovery code by removing all recovery-related
code from ha_rpd.cc and moving it to separate files:

  - ha_rpd_session_factory.h/cc:
  These files contain the MySQLAdminSessionFactory class, which is used
to create admin sessions in separate threads that can be used to issue
SQL queries.

  - ha_rpd_recovery.h/cc:
  These files contain the MySQLServiceRecoveryWorker,
MySQLServiceRecoveryJob and ObjectStoreRecoveryJob classes which were
previously defined in ha_rpd.cc. This file also contains a function that
creates the RecoveryWorkerFactory object. This object is passed to the
constructor of the Recovery Framework and is used to communicate with
the other section of the code located in rpdrecoveryfwk.h/cc.

This patch also renames rpdrecvryfwk to rpdrecoveryfwk for better
readability.

The include relationship between the files is shown on the following
diagram:

        rpdrecoveryfwk.h◄──────────────rpdrecoveryfwk.cc
            ▲    ▲
            │    │
            │    │
            │    └──────────────────────────┐
            │                               │
        ha_rpd_recovery.h◄─────────────ha_rpd_recovery.cc──┐
            ▲                               │           │
            │                               │           │
            │                               │           │
            │                               ▼           │
        ha_rpd.cc───────────────────────►ha_rpd.h       │
                                            ▲           │
                                            │           │
            ┌───────────────────────────────┘           │
            │                                           ▼
    ha_rpd_session_factory.cc──────►ha_rpd_session_factory.h

Other changes:
  - In agreement with Control Plane, the external Global State is now
  invalidated during recovery framework startup if:
    1) Recovery framework recognizes that it should load the Global
    State from an external source AND,
    2) rapid_reload_on_restart is set to OFF.

  - Addressed review comments for Patch #3, rapid_reload_on_restart is
  now also settable while plugin is ON.

  - Provide a single entry point for processing external Global State
  before starting the recovery framework loop.

  - Change when the Data Dictionary is read. Now we will no longer wait
  for the HeatWave nodes to connect before querying the Data Dictionary.
  We will query it when the recovery framework starts, before accepting
  any actions in the recovery loop.

  - Change the reload flow by inserting fake global state entries for
  tables that need to be reloaded instead of manually adding them to a
  list of tables scheduled for reload. This method will be used for the
  next phase where we will recover from Object Storage so both recovery
  methods will now follow the same flow.

  - Update secondary_load_dd_flag added in Patch #1.

  - Increase timeout in wait_for_server_bootup to 300s to account for
  long MySQL version upgrades.

  - Add reload_on_restart and reload_on_restart_dbg tests to the rapid
  suite.

  - Add PLUGIN_VAR_PERSIST_AS_READ_ONLY flag to "rapid_net_orma_port"
  and "rapid_reload_on_restart" definitions, enabling their
  initialization from persisted values along with "rapid_bootstrap" when
  it is persisted as ON.

  - Fix numerous clang-tidy warnings in recovery code.

  - Prevent suspended_basic and secondary_load_dd_flag tests to run on
  ASAN builds due to an existing issue when reinstalling the RAPID
  plugin.

-- Bug#33752387 --

Problem:
A shutdown of MySQL causes a crash in queries fired by DD worker.

Solution:
Prevent MySQL from killing DD worker's queries by instantiating a
DD_kill_immunizer before the queries are fired.

-- Patch #5 --

Problem:
A table can be loaded before the DD Worker queries the Data Dictionary.
This means that table will be wrongly processed as part of the external
global state.

Solution:
If the table is present in the current in-memory global state we will
not consider it as part of the external global state and we will not
process it by the recovery framework.

-- Bug#34197659 --

Problem:
If a table reload after restart causes OOM the cluster will go into
RECOVERYFAILED state.

Solution:
Recognize when the tables are being reloaded after restart and do not
move the cluster into RECOVERYFAILED. In that case only the current
reload will fail and the reload for other tables will be attempted.

Change-Id: Ic0c2a763bc338ea1ae6a7121ff3d55b456271bf0
percona-ysorokin pushed a commit that referenced this pull request Dec 6, 2022
Bug#34486254 - WL#14449: Mysqld crash - sig11 at rpd::ConstructQkrnExprCond
Bug#34381109 - Hypergraph offload Issue : LEFT JOIN test cases failing in i_subquery tests
Bug#34432230: Enabling testcase along with BM failures in order_by_limit_extended_mixed_varlen and scgts
Bug#34408615 - Assertion failure `args->ctx->m_lirid != kInvalidRelId' in ha_rpd_qkrn_expr.cc
Bug#34471424 - HYPERGRAPH HeatWave Visible fields not populated
Bug#34472083 - WL#14449: Mysqld crash - Assertion `extract_ctx.cmn_expr_or == nullptr' failed
Bug#34395166 - HYPERGRAPH BUG: QUERIES not offloading on heatwave in MTR SUITE
Bug#34056849 - WL#14449: Offload issue with dictionary encoding
Bug#34381126 - Hypergraph offload Issue : QCOMP test file non offload bug
Bug#34450394 - Hypergraph result mismatch
Bug#34472373 WL#14449: Mysqld crash - Assertion `args->ctx->m_lirid != kInvalidRelId' failed
Bug#34472354 WL#14449: Mysqld crash - sig11 at rpdrqce_check_const_cols_rpdopn
Bug#34472069 WL#14449: Mysqld crash - Assertion `n < size()' failed
Bug#34472058 WL#14449: Mysqld crash - sig11 at rpdrqc_construct_phyopt_bvfltr
Bug#34143535 - WL#14449: task formation error
Bug#34356273 - HYPERGRAPH BUG: CAST binary having issues with DICTIONARY ENCODING
Bug#34381303 - Hypergraph offload Issue : LIMIT DUAL not offloading
Bug#34356238 - HYPERGRAPH BUG: CAST DATE WITH DOUBLE_PRECISION
Bug#34448736 - Hypergraph Result mismatch:Result mismatch with user variables
BUG#34388727: Enabling testcases
Bug#34413698 - Hypergraph Union issues
Bug#34432241 - Hypergraph out of stack memory issue in rapid.qcomp_bugs_debug_notubsan_notasan
Bug#34369934 - Hypergraph Performance : TPCDS q93 qcomp issue -2
Bug#34399991 - HYPERGRAPH BUG: crash in cp_i_subquery_dict MTR file
Bug#34057893 - Fixing MTR timeout by reducing the partial JOIN search space
Bug#33321588 Hypergraph Result Mismatch : Cannot process QEXEC JSON document expected for each HeatWave node in query [no-close]
Bug#34395166 - HYPERGRAPH BUG: QUERIES not offloading on heatwave in MTR SUITE
Bug#34086457 - Hypergraph offload Issue : constant not marked correctly
Bug#34380519 BUG#33294870 BUG#34114292: Enabling testcases after these bug fixes.
BUG#34079278 : Partially enabling testcases for fixes cases.
BUG#33321588 : Fixing 'Cannot process QEXEC JSON document expected for each HeatWave node in query' Error
Bug#34360222 - HYPERGRAPH BUG: QUERIES WITH RANGE LIKE 1+1 NOT OFFLOADING WITH DICT ENCODING
Bug#34412319 - HyperGraph: sig 11 on bm mtr cp_blob_dict
Bug#34403562 - HyperGraph: Signal 6 while running rapid.cp_blob_dict testcase
Bug#34360341 - HYPERGRAPH BUG: QUERIES not offloading on heatwave with VARLEN ENCODING
Bug#34012291 - Hypergraph Offload Issue : Subquery is OOM instead of error ER_SUBQUERY_NO_1_ROW
Bug#34399868 - HYPERGRAPH BUG: Output mismatch in cp_i_index_merge
Bug#34289251 - WL#14449: post-join filters set as inner joins' extra predicates are not handled
Bug#34381354 - Hypergraph offload Issue : DATE COUNT not offloading
Bug#34119506 - Hypergraph Result Mismatch : Decimal precision issue
Bug#34399722 - HYPERGRAPH BUG: Output mismatch with mysql
Bug#34360278 - HYPERGRAPH BUG: QUERIES not offloading on heatwave
Bug#34369223 - HyperGraph: Offload failure when hypergraph is ON
Bug#34361863 - Impossible Condition Cases Failing with HyperGraph
Bug#34289797 - Hypergraph Optimizer: query projecting expression from inner side of outer join does not offload
Bug#34128728 - Hypergraph Crash : Due to ZERO ROWS
Bug#34066930 - Hypergraph Result Mismatch : Wrong result with Zero Rows
Bug#34078549 - Hypergraph Result Mismatch : Wrong result with ZERO ROWS Select_varlen test
Bug#33426211 - Hypergraph Offload issue : Due to the absence of few corner case optimizations
Bug#34086457 - Hypergraph offload Issue : constant not marked correctly
Bug#34299494 - Hypergraph : Disable commutative INNER JOIN
Bug#34299823 - Hypergraph Optimizer: Issue with projection set for partial plan
Bug#33380501 - WL#14449: expected error ER_SUBQUERY_NO_1_ROW but query succeeds
Bug#33811377 - WL#14449: SUBQUERY item in JOIN's extra predicates is not detected in partial plan
Bug#33410257 - Hypergraph Offload Issue :  Rapidside ENUM issue

* Add new file ha_rpd_estimate_ap.cc for costing AccessPath trees
  using the new Hypergraph Optimizer.

* Rework function CollectAllBaseTables() to not return any size
  information -- instead it can simply be computed by iterating
  over the map passed to it.

* Add member to Qkrn_context to store a pointer to the Hypergraph
  Optimizer object. When that is set we have a partial plan,
  otherwise it's the final plan.

* Add a couple of new timers for Hypergraph-based costing

* Replace all occurences of JOIN::having_for_explain with
  JOIN::having_cond, because the former one is not always populated
  correctly anymore.

* Ignore SQL SELECT_BIG_RESULT hint as it does not have any meaning
  for  RAPID.

* Set flags for handlerton::secondary_engine_flags

* Add new function Rapid_execution_context::IsQueryPushable() for
  partial plans.

* Currently, the patch contains a fix in the costing code which
  enables costing of any temporary table. This is ported forwards
  from the change for bug 34162247.

* Allow dumping partial plans by appending the overal partial
  plan ID  for item dump and QKRN dumps.

* Some const-ness fixes / improvements.

* Add code in ha_rpd_qkrn_ap.cc to extract the projection list
  for the root translate state of a partial plan.

* More fixes to partial plan projection set computation:
  In function ExtractGroupByHypergraph(), reuse function
  ExtractProjectionItemHypergraph() to extract sub-expressions correctly
  such that they math the current state_map.
  In function ExtractWindowHypergraph() take into account the current
  state_map, which was missing before and then for a base TABLE_SCAN we
  could pick up expressions from another base TABLE_SCAN, which led to
  offload errors.

* In HashJoinPullJoinConditionFromExtraPredicate(), remove a superfluous
  check whether Item_func::type() == Item::FUNC_ITEM.

* In TranslateHashJoin(), where we check the extra predicates also for
  inner joins, since those represent post join filters, initially we
  used UpdateItemListHashJoin(), which would project whole expressions
  from the post join filter, which leads to larger projection lists of
  the join and its children.
  For instance, for a query like
    SELECT t1.a, t2.b FROM t1 JOIN t2 ON ... WHERE t1.a > 5 OR t2.b > 10
  the WHERE condition ends up as post-join filter. Then, with the
  previous approach, t1 would project "t1.a" and "t1.a > 5" and t2 would
  project "t2.b" and "t2.b > 10" and then the post join filter would
  degrade into "qkrcol(t1.a > 5) != 0 OR qkrcol(t2.b > 10) != 0".
  Change to use UpdateItemList() which extracts only Item_fields, i.e.
  base columns, to match the behavior from the old optimizer.

* In ExtractProjectionItemHypergraph(), project all expressions for the
  final partial plan of a query block. This is necessary when e.g. a
  child query block is of the form
    SELECT 1, 2, 3, t1.a FROM t1 LIMIT 5
  Then, if we always ignore all constants in the TABLE_SCAN then when
  creating the TOPK node for the LIMIT we'll try to project the
  constants from there, which is not supported.

* Add TPC-H/DS perf job queries without STRAIGHT JOIN hints.

* Dump hypergraph performance profile timers to QEXEC_pp JSON dump.

* In Mtoq_table::GetCell() an earlier patch introduced a small
  optimization which was intended to skip an mrow_item in the case of
  aggregation Items (ITEM_SUM_FUNCs). The idea is that when one is
  <aggregate>(DISTINCT) and the other is just <aggregate>(), then
  they cannot be the same. Also, when the number of arguments differ,
  or when they are just different aggregates, then we don't need
  to call AreItemsEqual() after the special ITEM_SUM_FUNC code path.
  However, the "optimization" was wrong and skipped too many Items
  such that some were not found at all in the MtoQ table anymore.

* Add more state to the hypergraph optimizer to remove code from
   HeatWave side. In particular:
  * To decide whether ORDER BY LIMIT OFFSET can be handled we only
     need to check a new flag.
  * Whether there are multiple base tables in a query block is also
    tracked through a hypergraph optimizer flag.
  * Add handling for hypergraph filters on top of derived tables and
    joins.

Improve speed of Mtoq_table::GetCell
========================================
 Before, we called Qkrn_context::GetQueryBlock() in each loop
 iteration over the cells for the given query block. That is,
 however, already done before the loop once and we can re-use
 that retrieved query block.

 For DISTINCT aggregation Items (SUM_FUNC_ITEM) we have a special
 code path to compare them as equal when e.g.

 Also, improve code of AreDistinctItemListEqual():
 * Use only prefix incrementor.
 * Use copy operator to populate vector "it", which will reserve
   space to avoid multiple re-allocations.
 * Rename "it" to "first_copy" and "it1" to "iter".

Bug 34289251
============
Since the Hypergraph optimizer, post-join filters can be encoded
as join extra predicates, which was not yet considered in
HeatWave's AccessPath translation. This may result in wrong
results, as in the bug's case, where the query is quite simple:

  SELECT * FROM t1 NATURAL JOIN t2 WHERE t1.b > t2.b;

Here, the post-join filter "t1.b > t2.b" is encoded as a join extra
predicate.

The fix is to "simply" also consider those for inner joins, not
only for outer joins and semi joins and ensure that an intermediate
qkrnts is created for the post-join filter. Care had to be taken
to also ensure that all of the predicates' sub-expressions are
added to both projected and required items for the join itself.

Additionally added a few comments to function calls for constants
and added checks for return value of calls to
UpdateItemListHashJoin().

Bugs 33380501, 33811377:
========================
* Reset Qkrn_context::m_leaf_table_count before counting the leaf
  tables

* In CountLeafTables(): The hypergraph optimizer counts derived
  tables and common table expressions as leaf tables in their
  parent query blocks, so do not traverse any further.

* Bug 33380501: patch by Kajori Banerjee. Adds function
  rpdrqcj_rejects_multiple_rows() which recursively checks
  JOIN key expressions whether they contain a qkrcol which has
  reject_multiple_rows_qkrcol == true.

* Bug 33811377:
  Root cause: through IsQueryPushable we don't identify all cases
  where a subquery item is involved. In this case it is a join's
  extra predicate whose string representation is
    <in_optimizer>(8,<exists>(select #3))
  When we try to translate the expression it is inside an
  Item_cond_and
  and since it is const_for_exection(), we call Item::is_null(),
  which then executes the subquery and then leads to a mysqld
  crash. The fix is to also check evaluate_during_optimization()
  which identifies such a case. This is added in several places
  where we call a function which potentially evaluates the
  Item (like is_null()). By that, we avoid interpreting the
  subquery item as a constant and we will try to fully translate
  this subquery item, which will then hit the default
  (unsupported) case in ConstructQkrnExprTreeSwitch() by
  which we bail out of query translation.

* Avoid assertions in rapidjson due to calling IsQueryPushable()
  for every subplan. All calls to PrintOffloadTrace passed
    abort_query = true
  which however is wrong for partial plans, because we may get a
  supported one later.

* Split cp_join_prepart into dict and varlen tests.

* Make the Extract***Hypergraph functions for partial plans only
  return bool (false=success, true=error).

* Fix extraction of required items of GROUP BY keys and aggregation
  expressions for partial plans. This must be done similar to
  AccessPath style extraction using CompileItem. Therefore,
  refactored function ExtractRequiredItemsForGroupBy() to use
  the same functionality for partial plan extraction, too.

* In RapidEstimateQueryCostsAP() when we catch a RapidException,
  for now always print and offload trace and dump to screen the
  type and string message.

* Pick up more test changes from the original branch.

* Extract projection expressions from all WINDOWs' partition by
  and order by lists.

* Re-introduce handling aggregate items in UpdateItemList, which was
  removed during code coverage-related clean-up for worklog 14344
  (AP uptake).

* We were not aware that for semi joins we need to always project
  anexpression from the inner side due to some QCOMP issue.
  Filed bug 34252350 to track this issue.

* Fix issue with correlated semi joins where an OR condition was
  attached to the semi join as an additional condition
  (extra predicate) but was ignored because any non-Item::FUNC_ITEM
  was ignored when inspecting the extra predicates. Fix is to add
  Item::COND_ITEMs, too.

* In the AP cost estimation code, print the QCOMP exception if one
  occurs to console (for debugging, will be removed later).

* In TranslateDerivedTable(): When a derived table MATERIALIZE
  AccessPath is the root node and we're translating a partial plan,
  then when re-initializing the Qkrn_context for the parent query block
  (which contains the derived table's MATERIALIZE) we need to call
    InitializeQkrnContextPartialPlan()
  instead of
    InitializeQkrnContext().

* Fix issues when WITH ROLLUP modifier is present and when extracting
 expressions for the root AccessPath for partial plans.

* The Qkrn_context flags m_has_gby, m_has_ftag, m_has_windowfunc,
  and m_has_oby were used wrongly: they only indicate whether the
  current query block has one of those, but they don't indicate
  whether corresponding qkrn nodes were created.
* One issue with ENUM columns and WITH ROLLUP: when the ENUM column
  is inside an Item_rollup_group_item then do
* Remove argument "is_outer_query_block" from TransformQkrn().
  It is not needed as we can be sure that transformation is only
  done once perpartial plan or query fragment.

Bug 33410257 - Hypergraph Offload Issue :  Rapidside ENUM issue
===============================================================
The algorithm for detecting whether there are string operations on an
enum column in ContainsEnumOpn() was incomplete. It needs to keep
track of the parent operations, similar to how
ContainsUnsupportedCastEnumToChar() works.

Bug 34471424 - HYPERGRAPH HeatWave Visible fields not populated
===============================================================
For few partial plans in the bug's query, which contain a common
table expression, the hidden bitmap is not populated properly which
leads to an empty initial projection set for the first AccessPath
inside the CTE.

Fix is to ensure for partial plans that the projection items of the
CTE are all marked as not hidden in the hidden bitmap.

Note that there *may* be a different root cause which is specific
to some rewrite where we have a CTE inside another derived table.

Bug 34472083:
=============
The code for extracting potential join predicates from parent
FILTER AccessPaths was very strict by asserting that OR
conditionals are not nested. This can be, however, the case and
we should be more graceful. Especially for directly-nested ORs we
can simply pretend as if those ORs had been merged already by the
resolver/optimizer and proceed. For more complex nested OR/AND
filter or join predicates now just completely skip trying to
extract any predicates.

Added test wl14449_bugs with dict and varlen variants.

Bug 34395166:
=============
* scgts: The query had multiple partial plans failing because the
  hidden_map for HeatWaveVisibleFields adapter was not correct for
  some Common Table Expressions. As a quick fix, this is now
  corrected on the HeatWave side and in the meantime will be
  discussed with the MySQL optimizer team whether this is actually
  a bug on their side. The issue seems to be the following: The
  affected SCGTS query has a quite deep nesting of CTEs and derived
  tables and UNIONs and one of the CTE ismactually merged into
  another one.
  In the query, CTE
    snapshot_inventory_item_sub3
  is merged into CTE
    snapshot_inventory_item_sub4.
  while more CTEs may have been merged, we use only this example
  for illustration. Then, for computing the hidden_map, we check
  for each CTE instance its parent query block and search for all
  occurrences of Item_fields from that CTE. For that search, we use
  amongst others Query_block::top_join_list.
  Now in this case, the top_join_list however only contains the
  tables from the query block BEFORE the other CTE (and its tables
  and joins) was merged into that query block. In this case, the
  list only contains tables
    * po_octg_line
    * sub3b
  while it should contain tables
    * po_octg_line
    * snapshot_inventory_item_sub2 AS sub2
    * state_prov
    * invtry_item
    * pt_bol_date
  Those tables can be found in Query_block::leaf_tables, but not in
  Query_block::top_join_list.
  The (temporary?) fix is to check both Query_block::leaf_tables
  and Query_block::top_join_list.
* astro: The query is successfully offloaded again, re-enabled the
  test case and re-recorded the result file.

Other Changes:
==============
* ha_rpd_estimate_ap.cc: Correctly set qcomp_ret upon errors during
                         QCOMP.
* cloud_variables.result: Update due to latest bm pre-merge job.
* qcomp_bugs_debug_notubsan_notasan.result: Update due to latest bm
                                            pre-merge job.

Bug 34056849 - WL#14449: Offload issue with dictionary encoding
 ================================================================
1. The test cases in bug_poc_dict are rigtfully not offloaded given
   MYSQL's type input. On mysql-8.0 there are implicit casts
   (e.g. string to double) which enable these test cases to offload
   on that branch, but MySQL behavior is not consistent across all
   comparison functions and BETWEEN.
   Limitations on the HyperGraph branch are consistent (see the newly
   added test cases) and are in agreement with known HeatWave
   limitations with dictionary encoding.

2. Marked test cases in bushy_replaylog_dict are offloaded on the latest
   branch.

3. Test cases in join_outer_dict are rightfully not offloaded given
   the AP plan: the join condition requires comparison of different
   dictionary encoded columns (coming from a different base column).
   With natural join the plan is different: the join condition is
   pushed into both tables and the join is a cartesian product - hence
   there's no issue with comparing dict-encoded columns.
   On 8.0 both plans contain a cartesian product with two pushed-down
   filters - hence the offload issue does not exist.

Bug 34381126 :
================
Recording the testcases since bug not reproducible anymore

Bug 34450394 :
==================
Mismatch was due to insert into + select.
Changing test case to use order by in select
ensures the output is deterministic.

Bug#34472373
===================
Partial plan 4:
======================

-> Inner hash join prefix over( SUBQUERY2_t1_SUBQUERY2_t2):
   * Projection set = []
 -> FILTER prefix:
       * table_map = 0x0001
       * Original filter =
            (unhex('') = concat_ws('0',SUBQUERY2_t1.col_varchar_key
       * Projection set = [unhex('')]
       * Required items =
            [unhex(''), concat_ws('0',SUBQUERY2_t1.col_varchar_key)
       * DummyFields = [(none)(=0), (none)(=0)]
        -> TABLE_SCAN prefix on SUBQUERY2_t1:
           * Actual filter =
          (unhex('') = concat_ws('0',SUBQUERY2_t1.col_varchar_key)
           * Projection set =
          [unhex(''), SUBQUERY2_t1.col_int_key]

Problem :
===================
ExtractRequiredItemsForFilter adds the constant to the
required items.

Bug#34472354
===================
Same fix as above resolved the issue.

Bug#34472069 and Bug#34472058  : fixed in the latest branch

sum_distinct-big is timing out Job 1094368

Bug#34143535 - WL#14449: task formation error
 ==============================================
1) The distinct inside an inner query block of a partial plan was
notapplied. Updated the function ReadyToHandleDistinctAndOrderBy.

2) Problem :
Earlier all the partial plans were dumped irrespective of any
system variable or flag.

Solution :
Dump hypergraph partial plan only when system variable
qkrn_print_translate_states is set to true.

Bug 34356238
==============
Closed since diff was because we only match MySQL upto 5
decimal places.

Bug 34448736 :
===============
Closed since the diff matches with MySQL output.

Bug 34413698 - Hypergraph Union issues
======================================
Problem: Union with hypergraph cases failed due to
proper projection with special case of derived tables.

Solution: Update Projection Populate such that with this
special case of derived table + union, we appropriately
populate projection set.

Bug#34432241:
==============
Unnecessary call to ItemToString in IsConstValueItem causing the
issue. Removing the redundant call to ItemToString.

Bug#34369934
============
Join pattern changed due to the change in the order of the join
keys with the hypergraph.

Bug 34399991 - HYPERGRAPH BUG: crash in cp_i_subquery_dict MTR file
===================================================================
Ensure filter items do not have subquery in them

Bug#34395166 -P6
==================
Query :
SELECT * FROM t1 WHERE c_id=@min_cid OR c_id=@max_cid;

Has ZERO rows. But it was not getting offloaded from
IsQueryPushable.

Solution :
Do not check for IsQueryPushable for ZERO_ROWS, FAKE_SINGLE_ROW and
ZERO_ROWS_AGGREGATED.

Bug#34395166 - P5
===================
Query :

SELECT *
FROM t3
WHERE col_varchar_10_latin1_key IN (
   SELECT alias1.col_varchar_10_latin1_key
   FROM t1 AS alias1
   LEFT JOIN t1 AS alias2
   JOIN t2 AS alias3
   ON alias2.col_varchar_10_latin1_key
   ON alias1.col_varchar_1024_utf8_key
   WHERE alias1.pk AND alias1.pk < 3 OR alias1.pk AND alias3.pk);

The BV filter has the condition alias1.pk<>0 AND alias1.pk < 3.

The hash join node projects the expression (alias1.pk < 3) instead
of the individual columns. This later creates a problem during
qkrn transformation of the BV filter as one child of the condition
(AND) is a column instead of an expression.

Bug#34395166 - P2
====================
Query:
SELECT          RANK() OVER (ORDER BY AVG(id)) FROM t1;

The item_map of (RANK() OVER (ORDER BY AVG(id))) = 0
Hence it was not projected from the window function of the
partial plan 4. As a result the window node projected the dummy
column leading to an offload issue.

BUG#33321588 :
==============
- Root cause for this issue is, the ordering of the query result is
  not in expected format.
- Heatwave doesn't ensure ordering without explicit ORDER BY clause.
- ORDER BY clause is added at required place.

Bug 34086457 - Hypergraph offload Issue : constant not marked correctly
=======================================================================
Problem: coalesce(col1, col2) is not marked as Item type ITEM_CACHE
which leads to false value returned from isConstValue

Solution: check for function with string constants in isConstValue

Bug#34399868 - HYPERGRAPH BUG: Output mismatch in cp_i_index_merge
=================================================================
This is actually *NOT* a HyperGraph bug!

There are 2 test cases which are resulting in result mismatch on
mysql-8.0 as well. A bug has been open for them.

This did not show up before because cp_i_index_merge.test relies
only on randomized data. That is also risking unstable mtr.

A temporary solution here is to just put explain for all test cases
that might be unstable - which is what this patch is doing.

In the meantime the underlying bug 34423734 needs to be fixed on
8.0 and those concrete test cases (without randomness) will be
appended to this test file.

Bug#34399722 - HYPERGRAPH BUG: Output mismatch with mysql
==============================================================
The ROW_NUMBER() is a window function or analytic function
that assigns a sequential number to each row in the result set.
The first number begins with one. Since there is no ordering
metioned in the window function, the row numbers
might be assigned in a different order.

Bug#34361863 - Impossible Condition Cases Failing with HyperGraph
==================================================================
for hypergraph Rapid_execution_context fields like m_subquery_type
and m_best_outer_plan_num are not updated during join order
enumeration. Hence checking them in HasOnlyInvalidPlans is not
relevant.

BUG#34289797 :
========================
Not reproducible, thus enabling the relevant testcase
and re-recording the test file. Additionally, more queries offload
with hypergraph, changing offload correspondingly for
bushy_bugs_varlen and bushy_bugs_dict test files.

BUG 34128728, Bug 34066930 Bug 34078549
Fix for Zero rows induced problems with Hypergraph Optimizer
==================================================================
Problem: Hypergraph Optimizer (may) sometime propose JOIN plans
with Zero Row in Inner/Outer Child. In this case, we handle by
inserting Filter and ZeroRowAggregate replacing the ZeroRow AP.
However, the PopulateProjectionSet does not pick up the correct
state map of tables under the new Filter AP. TranslateHash join
now does special handling for OuterJOIN and ZeroRows, but this was
now being done for inner join as well, which is incorrect.

Solution: We introduce m_zero_row_state_map member in
Translate_State::filter struct, which is used to resolve the state
map when we replace the Zero Row AP with Filter AP. In subsequent
calls to GetUsedTables() from PopulateProjectionSet() for
new Filter() AP, we are able to resolve the correct state_map,
thus handling the projection correctly.
Secondly, in TranslateHashJoin(), the special processing of zero
rows in outer join case has additional check to ensure it's invoked
 only when outerjoin is present.

Bug#34299494 - Hypergraph : Disable commutative INNER JOIN
============================================================
Say Query is
SELECT * FROM t1, t2 WHERE t1.a=t2.a;

At present the hypergraph optimizer explores both join orders
(t1 t2) and (t2 t1). Since both are same with respect to
HeatWave perspective, it does not makes sense to explore
(t2 t1) when ( t1 t2) is already explored.

Bug#34299823 - Hypergraph Optimizer:
Issue with projection set for partial plan
==============================================================
* In ExtractProjectionItemHypergraph(), also project expressions
which do not reference any tables. This fixes queries where
partial plans would otherwise have no SELECT list at all, for
instance
    SELECT isnull(gen_rnd_ssn()) FROM ...
* In TranslateHashJoin(), for extra predicates which are used as
  post-join filters, use UpdateItemListHashJoin() instead of
  UpdateItemList, because it updates local variables
  "has_from_inner" and "has_from_outer" which may be important for
  dummy projection.
  Example query is e.g. from data_masking.test:267:
    SELECT isnull(gen_rnd_ssn()), gen_rnd_pan(gen_range(1,1))
    FROM t1 JOIN t1 AS t2 ON t1.e > t2.e
    GROUP BY gen_rnd_ssn();
  For that query the GROUP BY key and the SELECT list do not
  contain any expressions over base table columns. Hence, when
  reaching a plan with the AGGREGATE AccessPath, we don't have
  any required items for it, but it needs at least one required
  item for dummy projection (see TranslateGroupBy() for
  step case 1).

Further fixes:
* Correctly handle the post-join filter retrieved through
  GetHyperGraphFilterCond(). First, all sub-expressions must be
  projected from the join node itself and then the post-join filter
  must be added to the intermediate qkrnts' filter, too.
* Add function DumpItemToString() which simply prints the result of
  ItemToString() for the given Item to the console. This is just a
  small debug utility function because my gdb does not print that
  output anymore.

Bug#34490127 - Always enforce Hypergraph Optimizer for HeatWave
Bug#34303379 - mysqld got signal 6 in rpd::ExtractCommonExpression at ha_rpd_qkrn_ap.cc
Bug#34242831 - [Sonny] Cost estimation: AttachToCommonParent: Expected a join node
Bug#34151831 - Query offload fails at ExtractItemFromDerivedTable
Bug#33659812 - LEFT JOIN OFFLOAD ISSUE: AttachToCommonParent: Expected a join node

Bug 34490127:
=============
* Set optimizer switch for using the hypergraph optimizer in the
  constructor of Rapid_execution_context to use always it for HeatWave
  going forward.
* Added an offload Check in RapidOptimize() that indeed the Hypergraph
  Optimizer is used.
* To take effect we had to move the code in sql_select.cc (function
  Sql_cmd_dml::prepare) which updates lex->using_hypergraph_optimizer to
  after the call to open_tables_for_query() such that the secondary
  engine's prepare hook is called before checking the optimizer switch,
  otherwise the first change above would stay ineffective.
* Removed adding the cmake flag WITH_HYPERGRAPH_OPTIMIZER from file
  _helper.sh again such that we can test the current behavior that the
  hypergraph optimizer is disabled by default for other engines than
  HeatWave in PERF builds. We circumvent this behavior for HeatWave by
  setting the optimizer switch programmatically.
* Remove the `--hypergraph` flag again from r_mtr.json.
* Update test files to not source files not_hypergraph.inc or
  have_hypergraph.inc, because they test for hypergraph being enabled or
  not by setting the optimizer_switch, which is now not possible on PERF
  builds anymore and we don't need that behavior anymore with the above
  changes.
* partition_dict.test and partition_varlen.test : Added missing
  enforcement of secondary engine.
* Revert result for test autodb_hcs_console_queries
* Re-enable tests from disabled.def : rapid.autodb_hcs_console_queries
  and rapid.qcomp_cost_model_evaluation_tpch.
  Re-record result for qcomp_cost_model_evaluation_tpch.result.
* Re-record affected test results.

Bugs 34303379, 34242831, 34151831, 33659812:
============================================
* Added test cases
* Added a separate sonnys test file for all bugs which use the respective schema.

WL#14449 - Query_term support

* Implement remaining changes for proper Query_term support.
* In the qkropn dumping code which tracks the use of tables and base
  table columns, checks were missing whether RapidSchema is nullptr.
  This led to crashes in some DEBUG MTR tests.
* Stabilize and improve several tests

WL#14449 - Handle FILTER AccessPaths with FALSE predicate

Until Bug 34494609 is properly fixed, add functionality to replace
FILTER AccessPaths with only FALSE-predicate by ZERO_ROWS or
ZERO_ROWS_AGGREGATED as appropriate. The latter is used when ZERO_ROWS
can be propagated all the way up to an aggregation AccessPath in the
current Query_block.
The ZERO_ROWS will be propagated as far up the current query block as
possible. When the patch for this bug is fixed from MySQL side, this
functionality can be removed again.

Change-Id: I77e6b7a75bb9071d4ad4fbc22b445c4bd51a82c7
percona-ysorokin pushed a commit that referenced this pull request Apr 21, 2023
Upstream commit ID : fb-mysql-5.6.35/25b42b5d4365b6d2deb6bf40da9e776cd2e56698
PS-7780 : Merge fb-prod202101

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```

As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reviewed By: lth

Differential Revision: D25521518

fbshipit-source-id: a46ed3e71fa
percona-ysorokin pushed a commit that referenced this pull request Sep 22, 2023
Some minor refactoring of function find_item_in_list() before the fix.

- Code is generally aligned with coding standard.

- Error handling is separated out, now is is false for success and
  true for error.

- The found item is now an output argument, and a null pointer means
  the item was not found (along with the other two out arguments).

- The report_error argument is removed since it was always used as
  REPORT_EXCEPT_NOT_FOUND.

- A local variable "find_ident" is introduced, since it better
  represents that we are searching for a column reference than having
  separate field_name, table_name and db_name variables.

Item_field::replace_with_derived_expr_ref()

- Redundant tests were removed.

Function resolve_ref_in_select_and_group() has also been changed so
that success/error is now returned as false/true, and the found item
is an out argument.

Function Item_field::fix_fields()

- The value of thd->lex->current_query_block() is cached in a local
  variable.

- Since single resolving was introduced, a test for "field" equal to
  nullptr was redundant and could be eliminated, along with the indented
  code block that followed.

- A code block for checking bitmaps if the above test was false could
  also be removed.

Change-Id: I3cd4bd6a23dd07faff773bdf11940bcfd847c903
percona-ysorokin pushed a commit that referenced this pull request Sep 22, 2023
Two problems were identified for this bug. The first is seen by looking
at the reduced query:

select subq_1.c1 as c1
from (select subq_0.c0 as c0,
             subq_0.c0 as c1,
             90 as c2,
             subq_0.c1 as c3
      from (select (select v2 from table1) as c0,
                   ref_0.v4 as c1
            from table1 as ref_0
           ) as subq_0
      ) as subq_1
where EXISTS (select subq_1.c0 as c2,
                     case
                     when EXISTS (select (select v0 from table1) as c1
                                          from table1 as ref_8
                                          where EXISTS (select subq_1.c2 as c7
                                                        from table1 as ref_9
                                                       )
                                         )
                     then subq_1.c3
                     end as c5
              from table1 as ref_7);

In the innermost EXISTS predicate, a column subq_1.c2 is looked up.
It is erroneously found as the column subq_1.c0 with alias c2 in the
query block of the outermost EXISTS predicate. But this resolving is not
according to SQL standard: A table name cannot be part of a column alias,
it has to be a simple identifier, and any referencing column must also
be a simple identifier. By changing item_ref->item_name to
item_ref->field_name in a test in find_item_in_list, we ensure that the
match is against a table (view) name and column name and not an alias.

But there is also another problem. The EXISTS predicate contains a few
selected columns that are resolved and then immediately deleted since
they are redundant in EXISTS processing. But if these columns are
outer references and defined in a derived table, we may actually
de-reference them before the initial reference increment. Thus, those
columns are removed before they are possibly used later. This happens
to subq_1.c2 which is resolved in the outer-most query block and
coming from a derived table. We prevent this problem by incrementing
the reference count of selected expressions from derived tables earlier,
and we try to prevent this problem from re-occuring by adding an
"m_abandoned" field in class Item, which is set to true when the
reference count is decremented to zero and prevents the reference count
from ever be incremented after that.

Change-Id: Idda48ae726a580c1abdc000371b49a753e197bc6
percona-ysorokin pushed a commit that referenced this pull request Oct 26, 2023
https://jira.percona.com/browse/PS-8592

Description
-----------
GR suffered from problems caused by the security probes and network scanner
processes connecting to the group replication communication port. This usually
is not a problem, but poses a serious threat when another member tries to join
the cluster by initialting a connection to the member which is affected by
external processes using the port dedicated for group communication for longer
durations.

On such activites by external processes, the SSL enabled server stalled forever
on the SSL_accept() call waiting for handshake data. Below is the stacktrace:

    Thread 55 (Thread 0x7f7bb77ff700 (LWP 2198598)):
    #0 in read ()
    #1 in sock_read ()
    #2 in BIO_read ()
    #3 in ssl23_read_bytes ()
    #4 in ssl23_get_client_hello ()
    #5 in ssl23_accept ()
    #6 in xcom_tcp_server_startup(Xcom_network_provider*) ()

When the server stalled in the above path forever, it prohibited other members
to join the cluster resulting in the following messages on the joiner server's
logs.

    [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
    [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

Solution
--------
This patch adds two new variables

1. group_replication_xcom_ssl_socket_timeout

   It is a file-descriptor level timeout in seconds for both accept() and
   SSL_accept() calls when group replication is listening on the xcom port.
   When set to a valid value, say for example 5 seconds, both accept() and
   SSL_accept() return after 5 seconds. The default value has been set to 0
   (waits infinitely) for backward compatibility. This variable is effective
   only when GR is configred with SSL.

2. group_replication_xcom_ssl_accept_retries

   It defines the number of retries to be performed before closing the socket.
   For each retry the server thread calls SSL_accept()  with timeout defined by
   the group_replication_xcom_ssl_socket_timeout for the SSL handshake process
   once the connection has been accepted by the first accept() call. The
   default value has been set to 10. This variable is effective only when GR is
   configred with SSL.

Note:
- Both of the above variables are dynamically configurable, but will become
  effective only on START GROUP_REPLICATION.
percona-ysorokin pushed a commit that referenced this pull request Jan 23, 2024
…ocal DDL

         executed

https://perconadev.atlassian.net/browse/PS-9018

Problem
-------
In high concurrency scenarios, MySQL replica can enter into a deadlock due to a
race condition between the replica applier thread and the client thread
performing a binlog group commit.

Analysis
--------
It needs at least 3 threads for this deadlock to happen

1. One client thread
2. Two replica applier threads

How this deadlock happens?
--------------------------
0. Binlog is enabled on replica, but log_replica_updates is disabled.

1. Initially, both "Commit Order" and "Binlog Flush" queues are empty.

2. Replica applier thread 1 enters the group commit pipeline to register in the
   "Commit Order" queue since `log-replica-updates` is disabled on the replica
   node.

3. Since both "Commit Order" and "Binlog Flush" queues are empty, the applier
   thread 1

   3.1. Becomes leader (In Commit_stage_manager::enroll_for()).

   3.2. Registers in the commit order queue.

   3.3. Acquires the lock MYSQL_BIN_LOG::LOCK_log.

   3.4. Commit Order queue is emptied, but the lock MYSQL_BIN_LOG::LOCK_log is
        not yet released.

   NOTE: SE commit for applier thread is already done by the time it reaches
         here.

4. Replica applier thread 2 enters the group commit pipeline to register in the
   "Commit Order" queue since `log-replica-updates` is disabled on the replica
   node.

5. Since the "Commit Order" queue is empty (emptied by applier thread 1 in 3.4), the
   applier thread 2

   5.1. Becomes leader (In Commit_stage_manager::enroll_for())

   5.2. Registers in the commit order queue.

   5.3. Tries to acquire the lock MYSQL_BIN_LOG::LOCK_log. Since it is held by applier
        thread 1 it will wait until the lock is released.

6. Client thread enters the group commit pipeline to register in the
   "Binlog Flush" queue.

7. Since "Commit Order" queue is not empty (there is applier thread 2 in the
   queue), it enters the conditional wait `m_stage_cond_leader` with an
   intention to become the leader for both the "Binlog Flush" and
   "Commit Order" queues.

8. Applier thread 1 releases the lock MYSQL_BIN_LOG::LOCK_log and proceeds to update
   the GTID by calling gtid_state->update_commit_group() from
   Commit_order_manager::flush_engine_and_signal_threads().

9. Applier thread 2 acquires the lock MYSQL_BIN_LOG::LOCK_log.

   9.1. It checks if there is any thread waiting in the "Binlog Flush" queue
        to become the leader. Here it finds the client thread waiting to be
        the leader.

   9.2. It releases the lock MYSQL_BIN_LOG::LOCK_log and signals on the
        cond_var `m_stage_cond_leader` and enters a conditional wait until the
        thread's `tx_commit_pending` is set to false by the client thread
       (will be done in the
       Commit_stage_manager::process_final_stage_for_ordered_commit_group()
       called by client thread from fetch_and_process_flush_stage_queue()).

10. The client thread wakes up from the cond_var `m_stage_cond_leader`.  The
    thread has now become a leader and it is its responsibility to update GTID
    of applier thread 2.

    10.1. It acquires the lock MYSQL_BIN_LOG::LOCK_log.

    10.2. Returns from `enroll_for()` and proceeds to process the
          "Commit Order" and "Binlog Flush" queues.

    10.3. Fetches the "Commit Order" and "Binlog Flush" queues.

    10.4. Performs the storage engine flush by calling ha_flush_logs() from
          fetch_and_process_flush_stage_queue().

    10.5. Proceeds to update the GTID of threads in "Commit Order" queue by
          calling gtid_state->update_commit_group() from
          Commit_stage_manager::process_final_stage_for_ordered_commit_group().

11. At this point, we will have

    - Client thread performing GTID update on behalf if applier thread 2 (from step 10.5), and
    - Applier thread 1 performing GTID update for itself (from step 8).

    Due to the lack of proper synchronization between the above two threads,
    there exists a time window where both threads can call
    gtid_state->update_commit_group() concurrently.

    In subsequent steps, both threads simultaneously try to modify the contents
    of the array `commit_group_sidnos` which is used to track the lock status of
    sidnos. This concurrent access to `update_commit_group()` can cause a
    lock-leak resulting in one thread acquiring the sidno lock and not
    releasing at all.

-----------------------------------------------------------------------------------------------------------
Client thread                                           Applier Thread 1
-----------------------------------------------------------------------------------------------------------
update_commit_group() => global_sid_lock->rdlock();     update_commit_group() => global_sid_lock->rdlock();

calls update_gtids_impl_lock_sidnos()                   calls update_gtids_impl_lock_sidnos()

set commit_group_sidno[2] = true                        set commit_group_sidno[2] = true

                                                        lock_sidno(2) -> successful

lock_sidno(2) -> waits

                                                        update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()`

                                                        if (commit_group_sidnos[2]) {
                                                          unlock_sidno(2);
                                                          commit_group_sidnos[2] = false;
                                                        }

                                                        Applier thread continues..

lock_sidno(2) -> successful

update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()`

if (commit_group_sidnos[2]) { <=== this check fails and lock is not released.
  unlock_sidno(2);
  commit_group_sidnos[2] = false;
}

Client thread continues without releasing the lock
-----------------------------------------------------------------------------------------------------------

12. As the above lock-leak can also happen the other way i.e, the applier
    thread fails to unlock, there can be different consequences hereafter.

13. If the client thread continues without releasing the lock, then at a later
    stage, it can enter into a deadlock with the applier thread performing a
    GTID update with stack trace.

    Client_thread
    -------------
    #1  __GI___lll_lock_wait
    #2  ___pthread_mutex_lock
    #3  native_mutex_lock                                       <= waits for commit lock while holding sidno lock
    #4  Commit_stage_manager::enroll_for
    #5  MYSQL_BIN_LOG::change_stage
    #6  MYSQL_BIN_LOG::ordered_commit
    #7  MYSQL_BIN_LOG::commit
    percona#8  ha_commit_trans
    percona#9  trans_commit_implicit
    percona#10 mysql_create_like_table
    percona#11 Sql_cmd_create_table::execute
    percona#12 mysql_execute_command
    percona#13 dispatch_sql_command

    Applier thread
    --------------
    #1  ___pthread_mutex_lock
    #2  native_mutex_lock
    #3  safe_mutex_lock
    #4  Gtid_state::update_gtids_impl_lock_sidnos               <= waits for sidno lock
    #5  Gtid_state::update_commit_group
    #6  Commit_order_manager::flush_engine_and_signal_threads   <= acquires commit lock here
    #7  Commit_order_manager::finish
    percona#8  Commit_order_manager::wait_and_finish
    percona#9  ha_commit_low
    percona#10 trx_coordinator::commit_in_engines
    percona#11 MYSQL_BIN_LOG::commit
    percona#12 ha_commit_trans
    percona#13 trans_commit
    percona#14 Xid_log_event::do_commit
    percona#15 Xid_apply_log_event::do_apply_event_worker
    percona#16 Slave_worker::slave_worker_exec_event
    percona#17 slave_worker_exec_job_group
    percona#18 handle_slave_worker

14. If the applier thread continues without releasing the lock, then at a later
    stage, it can perform recursive locking while setting the GTID for the next
    transaction (in set_gtid_next()).

    In debug builds the above case hits the assertion
    `safe_mutex_assert_not_owner()` meaning the lock is already acquired by the
    replica applier thread when it tries to re-acquire the lock.

Solution
--------
In the above problematic example, when seen from each thread
individually, we can conclude that there is no problem in the order of lock
acquisition, thus there is no need to change the lock order.

However, the root cause for this problem is that multiple threads can
concurrently access to the array `Gtid_state::commit_group_sidnos`.

In its initial implementation, it was expected that threads should
hold the `MYSQL_BIN_LOG::LOCK_commit` before modifying its contents. But it
was not considered when upstream implemented WL#7846 (MTS:
slave-preserve-commit-order when log-slave-updates/binlog is disabled).

With this patch, we now ensure that `MYSQL_BIN_LOG::LOCK_commit` is acquired
when the client thread (binlog flush leader) when it tries to perform GTID
update on behalf of threads waiting in "Commit Order" queue, thus providing a
guarantee that `Gtid_state::commit_group_sidnos` array is never accessed
without the protection of `MYSQL_BIN_LOG::LOCK_commit`.
percona-ysorokin pushed a commit that referenced this pull request Feb 16, 2024
Part of WL#15135 Certificate Architecture

This patch adds an instance of TlsKeyManager to class
TransporterRegistry. This TlsKeyManager will handle certificate
authentication in all node types.

A new method TransporterRegistry::init_tls() configures TLS at
node startup time.

Change-Id: I1f9d3fff21ea7f2d9f009cce48823304c2baead7
percona-ysorokin pushed a commit that referenced this pull request Feb 16, 2024
Add a unit test, an NdbApi test, and an MTR test.

The unit test is testNdbProcess-t
The NdbApi test is testMgmd -n SshKeySigning
The MTR test is sign_keys in suite ndb_tls

Create the ndb_tls test suite.
Create the ndb-tls subdirectory in std_data.
Create a CA key and certificate in std_data/ndb-tls/.

Change-Id: Icec0fa4a9031be11facbd346d09debe8bc8bfe68
percona-ysorokin pushed a commit that referenced this pull request Feb 16, 2024
Negotiate TLS in SocketAuthTls SocketAuthenticator.

On the server side, TransporterRegistry always instantiates
a SocketAuthTls authenticator.

Change-Id: I826390b545ef96ec4224ff25bc66d9fdb7a5cf7a
percona-ysorokin pushed a commit that referenced this pull request Feb 16, 2024
Add the final bit of code into TransporterRegsitry to start TLS
before transporter upgrade, and update the MTR test results.

The tls_required and tls_off_certs tests will show TLS in use
for transporter connections to MGMD.

Change-Id: I2683447c02b27e498873fee77e0382c609a477cd
percona-ysorokin pushed a commit that referenced this pull request Mar 4, 2024
…ocal DDL

         executed

https://perconadev.atlassian.net/browse/PS-9018

Problem
-------
In high concurrency scenarios, MySQL replica can enter into a deadlock due to a
race condition between the replica applier thread and the client thread
performing a binlog group commit.

Analysis
--------
It needs at least 3 threads for this deadlock to happen

1. One client thread
2. Two replica applier threads

How this deadlock happens?
--------------------------
0. Binlog is enabled on replica, but log_replica_updates is disabled.

1. Initially, both "Commit Order" and "Binlog Flush" queues are empty.

2. Replica applier thread 1 enters the group commit pipeline to register in the
   "Commit Order" queue since `log-replica-updates` is disabled on the replica
   node.

3. Since both "Commit Order" and "Binlog Flush" queues are empty, the applier
   thread 1

   3.1. Becomes leader (In Commit_stage_manager::enroll_for()).

   3.2. Registers in the commit order queue.

   3.3. Acquires the lock MYSQL_BIN_LOG::LOCK_log.

   3.4. Commit Order queue is emptied, but the lock MYSQL_BIN_LOG::LOCK_log is
        not yet released.

   NOTE: SE commit for applier thread is already done by the time it reaches
         here.

4. Replica applier thread 2 enters the group commit pipeline to register in the
   "Commit Order" queue since `log-replica-updates` is disabled on the replica
   node.

5. Since the "Commit Order" queue is empty (emptied by applier thread 1 in 3.4), the
   applier thread 2

   5.1. Becomes leader (In Commit_stage_manager::enroll_for())

   5.2. Registers in the commit order queue.

   5.3. Tries to acquire the lock MYSQL_BIN_LOG::LOCK_log. Since it is held by applier
        thread 1 it will wait until the lock is released.

6. Client thread enters the group commit pipeline to register in the
   "Binlog Flush" queue.

7. Since "Commit Order" queue is not empty (there is applier thread 2 in the
   queue), it enters the conditional wait `m_stage_cond_leader` with an
   intention to become the leader for both the "Binlog Flush" and
   "Commit Order" queues.

8. Applier thread 1 releases the lock MYSQL_BIN_LOG::LOCK_log and proceeds to update
   the GTID by calling gtid_state->update_commit_group() from
   Commit_order_manager::flush_engine_and_signal_threads().

9. Applier thread 2 acquires the lock MYSQL_BIN_LOG::LOCK_log.

   9.1. It checks if there is any thread waiting in the "Binlog Flush" queue
        to become the leader. Here it finds the client thread waiting to be
        the leader.

   9.2. It releases the lock MYSQL_BIN_LOG::LOCK_log and signals on the
        cond_var `m_stage_cond_leader` and enters a conditional wait until the
        thread's `tx_commit_pending` is set to false by the client thread
       (will be done in the
       Commit_stage_manager::process_final_stage_for_ordered_commit_group()
       called by client thread from fetch_and_process_flush_stage_queue()).

10. The client thread wakes up from the cond_var `m_stage_cond_leader`.  The
    thread has now become a leader and it is its responsibility to update GTID
    of applier thread 2.

    10.1. It acquires the lock MYSQL_BIN_LOG::LOCK_log.

    10.2. Returns from `enroll_for()` and proceeds to process the
          "Commit Order" and "Binlog Flush" queues.

    10.3. Fetches the "Commit Order" and "Binlog Flush" queues.

    10.4. Performs the storage engine flush by calling ha_flush_logs() from
          fetch_and_process_flush_stage_queue().

    10.5. Proceeds to update the GTID of threads in "Commit Order" queue by
          calling gtid_state->update_commit_group() from
          Commit_stage_manager::process_final_stage_for_ordered_commit_group().

11. At this point, we will have

    - Client thread performing GTID update on behalf if applier thread 2 (from step 10.5), and
    - Applier thread 1 performing GTID update for itself (from step 8).

    Due to the lack of proper synchronization between the above two threads,
    there exists a time window where both threads can call
    gtid_state->update_commit_group() concurrently.

    In subsequent steps, both threads simultaneously try to modify the contents
    of the array `commit_group_sidnos` which is used to track the lock status of
    sidnos. This concurrent access to `update_commit_group()` can cause a
    lock-leak resulting in one thread acquiring the sidno lock and not
    releasing at all.

-----------------------------------------------------------------------------------------------------------
Client thread                                           Applier Thread 1
-----------------------------------------------------------------------------------------------------------
update_commit_group() => global_sid_lock->rdlock();     update_commit_group() => global_sid_lock->rdlock();

calls update_gtids_impl_lock_sidnos()                   calls update_gtids_impl_lock_sidnos()

set commit_group_sidno[2] = true                        set commit_group_sidno[2] = true

                                                        lock_sidno(2) -> successful

lock_sidno(2) -> waits

                                                        update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()`

                                                        if (commit_group_sidnos[2]) {
                                                          unlock_sidno(2);
                                                          commit_group_sidnos[2] = false;
                                                        }

                                                        Applier thread continues..

lock_sidno(2) -> successful

update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()`

if (commit_group_sidnos[2]) { <=== this check fails and lock is not released.
  unlock_sidno(2);
  commit_group_sidnos[2] = false;
}

Client thread continues without releasing the lock
-----------------------------------------------------------------------------------------------------------

12. As the above lock-leak can also happen the other way i.e, the applier
    thread fails to unlock, there can be different consequences hereafter.

13. If the client thread continues without releasing the lock, then at a later
    stage, it can enter into a deadlock with the applier thread performing a
    GTID update with stack trace.

    Client_thread
    -------------
    #1  __GI___lll_lock_wait
    #2  ___pthread_mutex_lock
    #3  native_mutex_lock                                       <= waits for commit lock while holding sidno lock
    #4  Commit_stage_manager::enroll_for
    #5  MYSQL_BIN_LOG::change_stage
    #6  MYSQL_BIN_LOG::ordered_commit
    #7  MYSQL_BIN_LOG::commit
    percona#8  ha_commit_trans
    percona#9  trans_commit_implicit
    percona#10 mysql_create_like_table
    percona#11 Sql_cmd_create_table::execute
    percona#12 mysql_execute_command
    percona#13 dispatch_sql_command

    Applier thread
    --------------
    #1  ___pthread_mutex_lock
    #2  native_mutex_lock
    #3  safe_mutex_lock
    #4  Gtid_state::update_gtids_impl_lock_sidnos               <= waits for sidno lock
    #5  Gtid_state::update_commit_group
    #6  Commit_order_manager::flush_engine_and_signal_threads   <= acquires commit lock here
    #7  Commit_order_manager::finish
    percona#8  Commit_order_manager::wait_and_finish
    percona#9  ha_commit_low
    percona#10 trx_coordinator::commit_in_engines
    percona#11 MYSQL_BIN_LOG::commit
    percona#12 ha_commit_trans
    percona#13 trans_commit
    percona#14 Xid_log_event::do_commit
    percona#15 Xid_apply_log_event::do_apply_event_worker
    percona#16 Slave_worker::slave_worker_exec_event
    percona#17 slave_worker_exec_job_group
    percona#18 handle_slave_worker

14. If the applier thread continues without releasing the lock, then at a later
    stage, it can perform recursive locking while setting the GTID for the next
    transaction (in set_gtid_next()).

    In debug builds the above case hits the assertion
    `safe_mutex_assert_not_owner()` meaning the lock is already acquired by the
    replica applier thread when it tries to re-acquire the lock.

Solution
--------
In the above problematic example, when seen from each thread
individually, we can conclude that there is no problem in the order of lock
acquisition, thus there is no need to change the lock order.

However, the root cause for this problem is that multiple threads can
concurrently access to the array `Gtid_state::commit_group_sidnos`.

In its initial implementation, it was expected that threads should
hold the `MYSQL_BIN_LOG::LOCK_commit` before modifying its contents. But it
was not considered when upstream implemented WL#7846 (MTS:
slave-preserve-commit-order when log-slave-updates/binlog is disabled).

With this patch, we now ensure that `MYSQL_BIN_LOG::LOCK_commit` is acquired
when the client thread (binlog flush leader) when it tries to perform GTID
update on behalf of threads waiting in "Commit Order" queue, thus providing a
guarantee that `Gtid_state::commit_group_sidnos` array is never accessed
without the protection of `MYSQL_BIN_LOG::LOCK_commit`.
percona-ysorokin pushed a commit that referenced this pull request Mar 4, 2024
…ocal DDL

         executed

https://perconadev.atlassian.net/browse/PS-9018

Merge remote-tracking branch 'venki/PS-9018-8.0-gca' into HEAD

Problem
-------
In high concurrency scenarios, MySQL replica can enter into a deadlock due to a
race condition between the replica applier thread and the client thread
performing a binlog group commit.

Analysis
--------
It needs at least 3 threads for this deadlock to happen

1. One client thread
2. Two replica applier threads

How this deadlock happens?
--------------------------
0. Binlog is enabled on replica, but log_replica_updates is disabled.

1. Initially, both "Commit Order" and "Binlog Flush" queues are empty.

2. Replica applier thread 1 enters the group commit pipeline to register in the
   "Commit Order" queue since `log-replica-updates` is disabled on the replica
   node.

3. Since both "Commit Order" and "Binlog Flush" queues are empty, the applier
   thread 1

   3.1. Becomes leader (In Commit_stage_manager::enroll_for()).

   3.2. Registers in the commit order queue.

   3.3. Acquires the lock MYSQL_BIN_LOG::LOCK_log.

   3.4. Commit Order queue is emptied, but the lock MYSQL_BIN_LOG::LOCK_log is
        not yet released.

   NOTE: SE commit for applier thread is already done by the time it reaches
         here.

4. Replica applier thread 2 enters the group commit pipeline to register in the
   "Commit Order" queue since `log-replica-updates` is disabled on the replica
   node.

5. Since the "Commit Order" queue is empty (emptied by applier thread 1 in 3.4), the
   applier thread 2

   5.1. Becomes leader (In Commit_stage_manager::enroll_for())

   5.2. Registers in the commit order queue.

   5.3. Tries to acquire the lock MYSQL_BIN_LOG::LOCK_log. Since it is held by applier
        thread 1 it will wait until the lock is released.

6. Client thread enters the group commit pipeline to register in the
   "Binlog Flush" queue.

7. Since "Commit Order" queue is not empty (there is applier thread 2 in the
   queue), it enters the conditional wait `m_stage_cond_leader` with an
   intention to become the leader for both the "Binlog Flush" and
   "Commit Order" queues.

8. Applier thread 1 releases the lock MYSQL_BIN_LOG::LOCK_log and proceeds to update
   the GTID by calling gtid_state->update_commit_group() from
   Commit_order_manager::flush_engine_and_signal_threads().

9. Applier thread 2 acquires the lock MYSQL_BIN_LOG::LOCK_log.

   9.1. It checks if there is any thread waiting in the "Binlog Flush" queue
        to become the leader. Here it finds the client thread waiting to be
        the leader.

   9.2. It releases the lock MYSQL_BIN_LOG::LOCK_log and signals on the
        cond_var `m_stage_cond_leader` and enters a conditional wait until the
        thread's `tx_commit_pending` is set to false by the client thread
       (will be done in the
       Commit_stage_manager::process_final_stage_for_ordered_commit_group()
       called by client thread from fetch_and_process_flush_stage_queue()).

10. The client thread wakes up from the cond_var `m_stage_cond_leader`.  The
    thread has now become a leader and it is its responsibility to update GTID
    of applier thread 2.

    10.1. It acquires the lock MYSQL_BIN_LOG::LOCK_log.

    10.2. Returns from `enroll_for()` and proceeds to process the
          "Commit Order" and "Binlog Flush" queues.

    10.3. Fetches the "Commit Order" and "Binlog Flush" queues.

    10.4. Performs the storage engine flush by calling ha_flush_logs() from
          fetch_and_process_flush_stage_queue().

    10.5. Proceeds to update the GTID of threads in "Commit Order" queue by
          calling gtid_state->update_commit_group() from
          Commit_stage_manager::process_final_stage_for_ordered_commit_group().

11. At this point, we will have

    - Client thread performing GTID update on behalf if applier thread 2 (from step 10.5), and
    - Applier thread 1 performing GTID update for itself (from step 8).

    Due to the lack of proper synchronization between the above two threads,
    there exists a time window where both threads can call
    gtid_state->update_commit_group() concurrently.

    In subsequent steps, both threads simultaneously try to modify the contents
    of the array `commit_group_sidnos` which is used to track the lock status of
    sidnos. This concurrent access to `update_commit_group()` can cause a
    lock-leak resulting in one thread acquiring the sidno lock and not
    releasing at all.

-----------------------------------------------------------------------------------------------------------
Client thread                                           Applier Thread 1
-----------------------------------------------------------------------------------------------------------
update_commit_group() => global_sid_lock->rdlock();     update_commit_group() => global_sid_lock->rdlock();

calls update_gtids_impl_lock_sidnos()                   calls update_gtids_impl_lock_sidnos()

set commit_group_sidno[2] = true                        set commit_group_sidno[2] = true

                                                        lock_sidno(2) -> successful

lock_sidno(2) -> waits

                                                        update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()`

                                                        if (commit_group_sidnos[2]) {
                                                          unlock_sidno(2);
                                                          commit_group_sidnos[2] = false;
                                                        }

                                                        Applier thread continues..

lock_sidno(2) -> successful

update_gtids_impl_own_gtid() -> Add the thd->owned_gtid in `executed_gtids()`

if (commit_group_sidnos[2]) { <=== this check fails and lock is not released.
  unlock_sidno(2);
  commit_group_sidnos[2] = false;
}

Client thread continues without releasing the lock
-----------------------------------------------------------------------------------------------------------

12. As the above lock-leak can also happen the other way i.e, the applier
    thread fails to unlock, there can be different consequences hereafter.

13. If the client thread continues without releasing the lock, then at a later
    stage, it can enter into a deadlock with the applier thread performing a
    GTID update with stack trace.

    Client_thread
    -------------
    #1  __GI___lll_lock_wait
    #2  ___pthread_mutex_lock
    #3  native_mutex_lock                                       <= waits for commit lock while holding sidno lock
    #4  Commit_stage_manager::enroll_for
    #5  MYSQL_BIN_LOG::change_stage
    #6  MYSQL_BIN_LOG::ordered_commit
    #7  MYSQL_BIN_LOG::commit
    percona#8  ha_commit_trans
    percona#9  trans_commit_implicit
    percona#10 mysql_create_like_table
    percona#11 Sql_cmd_create_table::execute
    percona#12 mysql_execute_command
    percona#13 dispatch_sql_command

    Applier thread
    --------------
    #1  ___pthread_mutex_lock
    #2  native_mutex_lock
    #3  safe_mutex_lock
    #4  Gtid_state::update_gtids_impl_lock_sidnos               <= waits for sidno lock
    #5  Gtid_state::update_commit_group
    #6  Commit_order_manager::flush_engine_and_signal_threads   <= acquires commit lock here
    #7  Commit_order_manager::finish
    percona#8  Commit_order_manager::wait_and_finish
    percona#9  ha_commit_low
    percona#10 trx_coordinator::commit_in_engines
    percona#11 MYSQL_BIN_LOG::commit
    percona#12 ha_commit_trans
    percona#13 trans_commit
    percona#14 Xid_log_event::do_commit
    percona#15 Xid_apply_log_event::do_apply_event_worker
    percona#16 Slave_worker::slave_worker_exec_event
    percona#17 slave_worker_exec_job_group
    percona#18 handle_slave_worker

14. If the applier thread continues without releasing the lock, then at a later
    stage, it can perform recursive locking while setting the GTID for the next
    transaction (in set_gtid_next()).

    In debug builds the above case hits the assertion
    `safe_mutex_assert_not_owner()` meaning the lock is already acquired by the
    replica applier thread when it tries to re-acquire the lock.

Solution
--------
In the above problematic example, when seen from each thread
individually, we can conclude that there is no problem in the order of lock
acquisition, thus there is no need to change the lock order.

However, the root cause for this problem is that multiple threads can
concurrently access to the array `Gtid_state::commit_group_sidnos`.

In its initial implementation, it was expected that threads should
hold the `MYSQL_BIN_LOG::LOCK_commit` before modifying its contents. But it
was not considered when upstream implemented WL#7846 (MTS:
slave-preserve-commit-order when log-slave-updates/binlog is disabled).

With this patch, we now ensure that `MYSQL_BIN_LOG::LOCK_commit` is acquired
when the client thread (binlog flush leader) when it tries to perform GTID
update on behalf of threads waiting in "Commit Order" queue, thus providing a
guarantee that `Gtid_state::commit_group_sidnos` array is never accessed
without the protection of `MYSQL_BIN_LOG::LOCK_commit`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants