Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ps 7626 8.0 #3

Closed
wants to merge 178 commits into from
Closed

Ps 7626 8.0 #3

wants to merge 178 commits into from

Conversation

ldonoso
Copy link
Owner

@ldonoso ldonoso commented Apr 7, 2021

No description provided.

patrickbirch and others added 30 commits June 1, 2020 11:31
modified threadpool - removed link added link to System Variables
Approved By: Erlend Dahl <erlend.dahl@oracle.com>
….49-part2

PS-7012: Merge MySQL 5.6.49 part 2
Modified source/conf.py source/release-notes/release-notes_index
Created source/release-notes/Percona-Server-5.6.49-89.0
PS-7219: Correct PRODUCT_FULL variable
modified file: yum_repo
added a downgrade section
mysql/mysql-server@8aeda36

Bug#30909369: MTR DOES NOT NOTICE A SERVER ABORT AFTER "BYE"

Issue:
======
After a test is run, when the server is shutdown, MTR
does not detect the exit status of the shutdown, and
reports that the test passed, even if the server crashes.

Fix:
====
Added a shutdown_report for every run of tests, which
will report if the shutdown of the servers was successful
or not. The exit status of shutdown is monitored and if
a failure is found, the error log is parsed and the
shutdown_report will contain the errors.

This also fixes Bug#29818311, which reported the same
problem.

Change-Id: I4d8043915f9b085bfbd25e811c957cd0b4809454
(cherry picked from commit 8aeda36)
Summary:
WebScaleSQL Feature: JUnit Support for MTR

Produces JUnit test reports that can be consumed by tools such as Jenkins.

Two new options are added to MTR:

  --junit-output=FILE   Output JUnit test summary XML to FILE.
  --junit-package=NAME  Set the JUnit package name to NAME for this test run.

A test run using junit reporting would look like:

  ./mtr --comment=rpl_row --junit-package=rpl_row --junit-output=rpl_row.xml \
        --suite=rpl --mysqld=--binlog-format=row

Typically, the package name should be the name for the test run, in
this case the suite name and its variation. Tests results are written
to the rpl_row.xml file (including test timing and output). If using
with Jenkins, the XML file can be used for publishing test results and
Jenkins would be able to show how many test failed, test duration and
whether to mark the build as failed or unstable (if there were failing
tests that succeeded on retry).

Test Plan: mtr

Reviewers: steaphan

Originally Reviewed By: steaphan
Adds number of tests that are failing in various ways to test how their
failures are propagated to the JUnit report:

01_pass:             always passes
02_disabled:         passes, but is disabled by disabled.def
03_skipped:          always skips itself
04_result_mismatch:  result file contents differs
05_failed_assertion: asserts always-false condition
06_bootstrap_crash:  crashes in the bootstrap phase
07_immediate_crash:  crashes on command
08_shutdown_crash:   crashes in the shutdown phase
09_boostrap_hang:    waits indefinitely in the bootstrap phase
10_immediate_hang:   waits indefinitely on command
11_shutdown_hang:    waits indefinitely in the shutdown phase
12_shutdown_hang_in_master:  master process of a master/slave pair waits
			     indefinitely in the shutdown phase
13_shutdown_hang_in_slave:   slave process of a master/slave pair waits
			     indefinitely in the shutdown phase
14_shutdown_hang_in_both:    both master and slave wait indefinitely in the
			     shutdown phase
15_fail_every_second_time:   fails every second time
16_bootstrap_buffer_overrun: writes behind a buffer during bootstrap
17_immediate_buffer_overrun: writes behind a buffer on command
18_shutdown_buffer_overrun:  writes behind a buffer during shutdown
19_leak_memory:              allocates but never frees a block of memory
20_leak_memory_in_master:    master process leaks memory
21_leak_memory_in_slave:     slave process leaks memory
22_leak_memory_in_both:		    both master and slave process leak memory
Failing unit tests are included into JUnit-style report. To evaluate
reporting capability, we need assorted predictable failures.

This patch adds a test suite containing test cases that:
- succeed;
- fail on EXPECT_EQ();
- crash by calling abort();
- wait indefinitely;
- write behind a memory buffer;
- leak memory.

Test suite is disabled by default. Enable with -DWITH_FAILING_GUNIT_TESTS.
There is now a time limit for the unit tests when they are started by
mysql-test-run.pl. As they are treated as another test case, the same
"testcase-timeout" parameter is used.

Also, CTest output is included into a JUnit report.
PS-7213 Add jenkins file for autobuilds
Collect all the output related to shutdown and valgrind reports
and add it as a comment to shutdown_report and valgrind_report
tests in case failures are found. This will also add corresponding
output to JUnit reports.
If Valgrind was enabled for mysqld, CTest is now also asked to run its
tests using Valgrind's Memcheck tool.
Passing unit tests may still have Valgrind warnings in them. That messages
are lost unless CTest output is always included into JUnit reports.
Same test may be run by framework a few times with different
environment settings (like replication mode for example) and
in case this environment doesn't match test requirements the
test will be reported as skipped. Don't show such tests in
results xml file in order not to overload results with
false skipped tests.
5.6 - Fixed PS-7346 - Huge memory usage for audit plugin when large queries…
	modified:   source/faq.rst
	modified:   source/flexibility/binlogging_replication_improvements.rst
	modified:   source/flexibility/csv_engine_mode.rst
	modified:   source/flexibility/extended_mysqlbinlog.rst
…on_triggers" (5.7)

modified:   source/myrocks/status_variables.rst
PS-7374: Travis CI and Azure: add clang-11 testing
https://jira.percona.com/browse/PS-7200
***
Updated man pages from MySQL Server 5.6.50 source tarball.
***
Updated 'scripts/fill_help_tables.sql' from MySQL Server 5.6.50 source tarball.
…olved)

https://jira.percona.com/browse/PS-7200
***
VERSION raised to "5.6.50-90.0".
PERCONA_INNODB_VERSION in univ.i raised to "90.0".
PS-7379: fix compilation issues with clang-11 (5.6)
….50-part1

PS-7200: Merge MySQL 5.6.50 part 1 (up to e453966)
ldonoso pushed a commit that referenced this pull request Apr 7, 2021
TABLESPACE STATE DOES NOT CHANGE THE SPACE TO EMPTY

After the commit for Bug#31991688, it was found that an idle system may
not ever get around to truncating an undo tablespace when it is SET INACTIVE.
Actually, it takes about 128 seconds before the undo tablespace is finally
truncated.

There are three main tasks for the function trx_purge().
1) Process the undo logs and apply changes to the data files.
   (May be multiple threads)
2) Clean up the history list by freeing old undo logs and rollback
   segments.
3) Truncate undo tablespaces that have grown too big or are SET INACTIVE
   explicitly.

Bug#31991688 made sure that steps 2 & 3 are not done too often.
Concentrating this effort keeps the purge lag from growing too large.
By default, trx_purge() does step#1 128 times before attempting steps
#2 & #3 which are called 'truncate' steps.  This is set by the setting
innodb_purge_rseg_truncate_frequency.

On an idle system, trx_purge() is called once per second if it has nothing
to do in step 1.  After 128 seconds, it will finally do steps 2 (truncating
the undo logs and rollback segments which reduces the history list to zero)
and step 3 (truncating any undo tablespaces that need it).

The function that the purge coordinator thread uses to make these repeated
calls to trx_purge() is called srv_do_purge(). When trx_purge() returns
having done nothing, srv_do_purge() returns to srv_purge_coordinator_thread()
which will put the purge thread to sleep.  It is woke up again once per
second by the master thread in srv_master_do_idle_tasks() if not sooner
by any of several of other threads and activities.

This is how an idle system can wait 128 seconds before the truncate steps
are done and an undo tablespace that was SET INACTIVE can finally become
'empty'.

The solution in this patch is to modify srv_do_purge() so that if trx_purge()
did nothing and there is an undo space that was explicitly set to inactive,
it will immediately call trx_purge again with do_truncate=true so that steps
#2 and #3 will be done.

This does not affect the effort by Bug#31991688 to keep the purge lag from
growing too big on sysbench UPDATE NO_KEY. With this change, the purge lag
has to be zero and there must be a pending explicit undo space truncate
before this extra call to trx_purge is done.

Approved by Sunny in RB#25311
ldonoso pushed a commit that referenced this pull request Apr 7, 2021
…TH VS 2019 [#3] [noclose]

storage\ndb\src\common\portlib\NdbThread.cpp(1240,3): warning C4805: '==': unsafe mix of type 'int' and type 'bool' in operation

Change-Id: I33e3ff9845f3d3e496f64401d30eaa9b992da594
@ldonoso ldonoso closed this Apr 7, 2021
ldonoso pushed a commit that referenced this pull request Jul 6, 2021
Upstream commit ID : fb-mysql-5.6.35/25b42b5d4365b6d2deb6bf40da9e776cd2e56698
PS-7780 : Merge fb-prod202101

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```

As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reviewed By: lth

Differential Revision: D25521518

fbshipit-source-id: a46ed3e71fa
ldonoso pushed a commit that referenced this pull request Jul 8, 2021
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
ldonoso pushed a commit that referenced this pull request Nov 4, 2021
Move misplaced comment verbatim

Change-Id: Iec47fcbdc5f145e8131dd24b3c34a66258ef74fd
ldonoso pushed a commit that referenced this pull request Nov 4, 2021
Patch #3: Multi-value indexes are not used for MEMBER OF combined with OR.

If we have a MEMBER OF predicate that is combined with other predicates
with an OR operator, the range optimizer will not consider MEMBER OF
as a candidate for using an index.

Fixed by adding a case construct for Item_func::MEMBER_OF_FUNC in
get_func_mm_tree().

json_overlaps() and json_contains() already has this type of support.

This is a contribution by Yubao Liu.

Change-Id: I1fd7a78091998437310973b3c24099ad554a58a6
ldonoso pushed a commit that referenced this pull request Jan 19, 2022
…close]

Make the range optimizer return AccessPaths instead of TABLE_READ_PLAN.
This is the first step of getting rid of TABLE_READ_PLAN and moving
everything into AccessPath; currently, it's just a very thin shell:

 1. TRPs are still used internally, and AccessPath is created
    at the very end.
 2. Child TRPs are still child TRPs (ie., there are no child
    AccessPaths).
 3. All returned AccessPaths are still of the type INDEX_RANGE_SCAN,
    wrapping a TRP.
 4. Some callers still reach directly into the TRP, assuming #3.

Most callers (save for the aforemented #4) use a set of simple wrapper
functions to access TRP-derived properties from AccessPaths; as we
continue the transformation, this is the main place we'll change the
interaction (ie., most of the calling code will remain unchanged).

Change-Id: I3d9dc9e33c53d1e5124ea9c47b7d6d9270cd1906
ldonoso pushed a commit that referenced this pull request Jan 19, 2022
This error happens for queries such as:

SELECT ( SELECT 1 FROM t1 ) AS a,
  ( SELECT a FROM ( SELECT x FROM t1 ORDER BY a ) AS d1 );

Query_block::prepare() for query block #4 (corresponding to the 4th
SELECT in the query above) calls setup_order() which again calls
find_order_in_list(). That function replaces an Item_ident for 'a' in
Query_block.order_list with an Item_ref pointing to query block #2.
Then Query_block::merge_derived() merges query block #4 into query
block #3. The Item_ref mentioned above is then moved to the order_list
of query block #3.

In the next step, find_order_in_list() is called for query block #3.
At this point, 'a' in the select list has been resolved to another
Item_ref, also pointing to query block #2. find_order_in_list()
detects that the Item_ref in the order_list is equivalent to the
Item_ref in the select list, and therefore decides to replace the
former with the latter. Then find_order_in_list() calls
Item::clean_up_after_removal() recursively (via Item::walk()) for the
order_list Item_ref (since that is no longer needed).

When calling clean_up_after_removal(), no
Cleanup_after_removal_context object is passed. This is the actual
error, as there should be a context pointing to query block #3 that
ensures that clean_up_after_removal() only purge Item_subselect.unit
if both of the following conditions hold:

1) The Item_subselect should not be in any of the Item trees in the
   select list of query block #3.

2) Item_subselect.unit should be a descendant of query block #3.

These conditions ensure that we only purge Item_subselect.unit if we
are sure that it is not needed elsewhere. But without the right
context, query block #2 gets purged even if it is used in the select
lists of query blocks #1 and #3.

The fix is to pass a context (for query block #3) to clean_up_after_removal().
Both of the above conditions then become false, and Item_subselect.unit is
not purged. As an additional shortcut, find_order_in_list() will not call
clean_up_after_removal() if real_item() of the order item and the select
list item are identical.

In addition, this commit changes clean_up_after_removal() so that it
requires the context to be non-null, to prevent similar errors. It
also simplifies Item_sum::clean_up_after_removal() by removing window
functions unconditionally (and adds a corresponding test case).

Change-Id: I449be15d369dba97b23900d1a9742e9f6bad4355
ldonoso pushed a commit that referenced this pull request Mar 15, 2022
…ercona#871)

Summary:
Original report: https://jira.mariadb.org/browse/MDEV-15816

To reproduce this bug just following below steps,

client 1:
USE test;
CREATE TABLE t1 (i INT) ENGINE=MyISAM;
HANDLER t1 OPEN h;
CREATE TABLE t2 (i INT) ENGINE=RocksDB;
LOCK TABLES t2 WRITE;

client 2:
FLUSH TABLES WITH READ LOCK;

client 1:
INSERT INTO t2 VALUES (1);

So client 1 acquired the lock and set m_lock_rows = RDB_LOCK_WRITE.
Then client 2 calls store_lock(TL_IGNORE) and m_lock_rows was wrongly
set to RDB_LOCK_NONE, as below

```
 #0  myrocks::ha_rocksdb::store_lock (this=0x7fffbc03c7c8, thd=0x7fffc0000ba0, to=0x7fffc0011220, lock_type=TL_IGNORE)
 #1  get_lock_data (thd=0x7fffc0000ba0, table_ptr=0x7fffe84b7d20, count=1, flags=2)
 #2  mysql_lock_abort_for_thread (thd=0x7fffc0000ba0, table=0x7fffbc03bbc0)
 #3  THD::notify_shared_lock (this=0x7fffc0000ba0, ctx_in_use=0x7fffbc000bd8, needs_thr_lock_abort=true)
 #4  MDL_lock::notify_conflicting_locks (this=0x555557a82380, ctx=0x7fffc0000cc8)
 #5  MDL_context::acquire_lock (this=0x7fffc0000cc8, mdl_request=0x7fffe84b8350, lock_wait_timeout=2)
 #6  Global_read_lock::lock_global_read_lock (this=0x7fffc0003fe0, thd=0x7fffc0000ba0)
```

Finally, client 1 "INSERT INTO..." hits the Assertion 'm_lock_rows == RDB_LOCK_WRITE'
failed in myrocks::ha_rocksdb::write_row()

Fix this bug by not setting m_locks_rows if lock_type == TL_IGNORE.

Closes facebook/mysql-5.6#838
Pull Request resolved: facebook/mysql-5.6#871

Differential Revision: D9417382

Pulled By: lth
ldonoso pushed a commit that referenced this pull request Mar 15, 2022
Summary:
Currently during primary key's value encode, its ttl value can be from either
one of these 3 cases
1. ttl column in primary key
2. non-ttl column
   a. old record(update case)
   b. current timestamp
3. ttl column in non-key field

Workflow #1: first in Rdb_key_def::pack_record() find and
store pk_offset, then in value encode try to parse key slice to fetch ttl
value by using pk_offset.

Workflow #3: fetch ttl value from ttl column

The change is to merge #1 and #3 by always fetching TTL value from ttl column,
not matter whether the ttl column is in primary key or not. Of course, remove
pk_offset, since it isn't used.

BTW, for secondary keys, its ttl value is always from m_ttl_bytes, which is
stored by primary value encoding.

Differential Revision: D14662716
ldonoso pushed a commit that referenced this pull request Mar 15, 2022
Summary:
1. Remove reference to rpc_protocol as we don't have detached session yet
2. Remove reference to information_schema.processlist.srv_id as the field is gone in 8.0
3. mysql_restart.inc now outputs # restart
4. SELECT count(*) from secondary index key without primary clustering index has different behavior in 8.0. It returns 0 in 8.0 which does a table scan over primary key. I've added a >0 to force a index scan.
5. global_status/session-status is moved from information_schema to performance_schema
6. hermitage.result is rebasedline as we don't have the snapshot conflict error message patch yet. Once it is ported we can rebaseline it (it's better to have more test coverage this way)
7. mysql.event table is gone (in favor of DD) - use real CREATE EVENT in misc.test
8. Rebaseline MyRocks error code for the error code shift. We have a task for looking into this later.
9. Partially port can_hold_read_locks_on_select / can_hold_locks_on_trans to pass more locking related tests.

Reviewed By: lloyd

Differential Revision: D17622863
ldonoso pushed a commit that referenced this pull request Mar 15, 2022
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than #1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than #3, on par as #1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to #2/#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672
ldonoso pushed a commit that referenced this pull request Mar 15, 2022
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than #1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than #3, on par as #1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to #2/#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672
ldonoso pushed a commit that referenced this pull request Mar 18, 2022
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than #1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than #3, on par as #1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to #2/#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672
ldonoso pushed a commit that referenced this pull request Mar 22, 2022
MySQL Bypass Project Initial Change

Summary:
MySQL bypass infrastructure change

1. Adds a new hton override `handlerton::handle_single_table_select` to allow hijacking of SELECT statement execution from any storage engine.
2. Add lex parsing for bypass hints in optimizer hint style /*+ bypass */ and /*+ no_bypass */
3. MyRocks overrides handle_single_table_select but does nothing and simply fallback.

This is a port from from bypass feature branch - change #1 of 3. I'm planning to port the changes in the following order:

1. MySQL lexer + SELECT hijacking (this one).
2. SELECT Statement parsing and checking
3. Execution of SELECT statement in bypass

Reference Patch: facebook/mysql-5.6@2d19dc0

-----
Porting Notes:

* The entry points have changed due to MySQL 8.0 refactoring to SELECT path, and now it is in `Sql_cmd_dml::execute_inner`. I'll also see if overriding the entire execute would be better but this should be good enough for now.
* The check for whether we should go to bypass handler is mostly the same, except st_select_lex_t is now SELECT_LEX.
* The lex parsing for hints was mostly ported as-is. I was also debating whether I should just use optimizer hints instead of hacking the lexer, but decided against using optimizer hint for now so that migration from 5.6 will be easier. We can migrate to optimizer hint once 8.0 migration is complete.

Differential Revision: D22807040

-----------------------------------------------------------------------------------

MySQL bypass change #2 - SELECT parsing

Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook/mysql-5.6@e79b20a
facebook/mysql-5.6@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

-----------------------------------------------------------------------------------

MySQL Bypass Change #3 - Query Execution

Summary:
MySQL bypass execution of select statements. All the query plan / execution are in `select_exec` class.

When creating the query plan, the following two functions are the most important:

1. scan_value determines whether to unpack index, value, or both
2. scan_where determines which part of WHERE are indexes and builds the key with it, in index order. And the rest are filters used to include/skip values.

When it comes to unpacking - we have two code path:

1. The unpack_fast* functions are the fast path that unpack directly into text buffer m_row_buf and a mapping data structure m_send_mapping that maps from SELECT item order to a (buf, len) tuple pointing to the text, later to be send out in the SELECT item order using protocol::store_string_data (I had to expose this internal function and rename it, but I'm open for suggestions. The string overload checks the type of argument so it asserts when you call store(string) for a int field, etc. But I'm wondering if this actually expose a bug that the client would see as incorrect type and our tools won't catch it. TODO: Validate this, but this is not necessarily blocking). Note in this case we don't support filtering which is to be added in the future for simple cases.

2. The slower path unpack* functions will unpack first into record[0], use eval_cond to do filtering using MySQL Item::val_int() to evaluate the conditional expressions. When sending rows, use protocol::send_result_set_row to send the results using select_lex->items pointing to record[0].

The rest should be self-explanatory.

--------
Porting Notes:

Removed unpack_fast_path as testing showed it didn't have much impact to the performance benchmarks. And also it didn't work in 8.0 because we've made some format changes to VARCHAR/BLOBs. This would greatly simplify bypass feature and make it work much better with future format changes.

Reviewed By: luqun

Differential Revision: D22818216

-----------------------------------------------------------------------------------

Fix bypass bug in prefix decoding optimization

Summary:
This is a classic stale pointer bug discovered in data-correctness testing and we are unpacking int fields to garbage as a result.

In bypass we try to optimize prefix decoding for case like `SELECT d, c, b, a FROM t WHERE a=1 and b=2` with index being `(a, b, c, d)`. We unpack a, b and remember the location for the first time, and in future unpacking we can simply skip ahead to c, d. The problem is we remember the location of a/b in m_send_mapping with a pointer into m_row_buf (for unpacking) but that may reallocate and leading to stale pointers in m_send_mapping.

The fix is to use offset rather than pointers.

Also updated test case. But unfortunately it's hard to make sure the test actually did crash before the fix (usually it just prints out old value that are still there) - I did manually verify in debugging that it did trigger the condition.

Reference Patch: facebook/mysql-5.6@e24b0bf454c

 ---
Porting Notes:

The fix no longer applies after removing fast unpacking code. Only keeping the tests.

Reviewed By: luqun

Differential Revision: D22854608

-----------------------------------------------------------------------------------

Fix bypass bug in unpack info

Summary:
Data-correctness tool find another issue with primary key unpacking that it incorrectly treat 0x3 value as unpacking info because we were doing the wrong unpack info check.

The way unpack info is detected for primary key and secondary key are different: for primary key we check all the keys in setup_field_encoder while for secondary key we check the unpack tag (which should always be there). In bypass we are not distingushing between primary key / secondary key unpack info decoding separately (and instead we always do both, which is incorrect). This fixes that and adds a test.

Also, decode_value_header is only intended for unpacking pk and never secondary keys because it uses m_maybe_unpack_info which is only intended for primary key scenario. I've renamed the function to explicitly indicate it is for pk scenario only.

Reference Patch: facebook/mysql-5.6@f1664c144d1

 ---
Porting Notes:

Some fast unpacking related code are removed.

Reviewed By: luqun

Differential Revision: D22856130

-----------------------------------------------------------------------------------

Use enum for mysql_select_bypass_policy to make it more user friendly

Summary:
We should be using enum instead of integer values for the policy as enum values are more user friendly, and you can use both integer values and the string names. The new names are "always_on", "always_off", "opt_in", "opt_out". (I started with use "off" and "force" but then realized force is a keyword..)

Reference Patch: facebook/mysql-5.6@698d945e7a3

Reviewed By: luqun

Differential Revision: D22856678

-----------------------------------------------------------------------------------

Reporting rejected bypass queries into I_S

Summary: Reporting rejected bypass queries into information_schema instead of error log

Differential Revision: D24106900

-----------------------------------------------------------------------------------

Bypass rejected query normalization

Summary:
For protecting privacy of clients, we need to hide the detailed query information while recording a rejected bypass query. To do that, we digest/normalize each rejected bypass query before write it into the RDB_I_S. Where a 'digested query' is a textual representation of a query, where:
  - comments are removed,
  - non significant spaces are removed,
  - literal values are replaced with a special '?' marker,
  - lists of values are collapsed using a shorter notation

Differential Revision: D24155757

-----------------------------------------------------------------------------------

Fix rows_read and queries_range stats in bypass

Summary:
Bypass updates rows_sent and rows_examined but didn't update rows_read. This fixes that and adds tests. We also get rocksdb_queries_point and rocksdb_queries_range for free as we are using MyRocks API to access data. Note I didn't include rows_examined in the test because querying unrelated information_schema tables will affect rows_examined and it is kinda flaky depending on many factors like perf schema. I did found a double counting issue with rocksdb_queries_range and fixed it as well.

Reference Patch: facebook/mysql-5.6@6106eba

----
Porting Notes: Removed updates to table stats which is not ported to 8.0.

Differential Revision: D25809584

-----------------------------------------------------------------------------------

Fix bypass range query

Summary:
In current bypass implementation, its range query implementation is not correct. For A >= begin and A <= end , it'll only pick begin *or* end as start of the range, but it will only use the prefix as the end of the range, which means it can potentially scan a lot more rows than needed. It rely on the condition A >= begin / A <= end to filter out the unnecessary rows so the end result is correct (that's why this issue is not discovered in testing).

The correct way to implement is to set up (begin, end) range slice correctly, and do NOT rely on any conditional evaluation to filter rows. The tricky part is to determine the correct begin/end based on a few factors:
* forward / reverse column family
* ascending / descending orders
* inclusive vs non-inclusive ranges (>=, <= vs <, >)
* prefix query vs range query

For better testing, I've done the following:
* verify_bypass_query.inc that will run the query in bypass and non-bypass, verify the row reads are same or less than non-bypass case to ensure such issue won't happen again. Most bypass query validation are moved to use verify_bypass_query.inc. We can also add more validation in the future in verify_bypass_query.inc as needed. As a bonus, it'll make verifying the results are the same in bypass/non-bypass case easy as well.
* move range key validation from bypass_select_scenarios into bypass_select_range_pk and bypass_select_range_sk
* added more test cases to bypass_select_range_pk / bypass_select_range_sk, especially for PK case where some scenarios are missing
* dump query results to file and verify query results didn't change w/ and w/o bypass. I had to back port `output` mysqltest support to make this work properly (for some reason --exec $MYSQL doesn't like multiline queries".

For review, there is no need to go through results file as there are validation in-place to make sure query return same results w/ and w/o bypass.

Reference Patch: facebook/mysql-5.6@ef9a677

------------
Porting Notes:

* Disabled a scenairo with TEXT BLOB prefix key scenario that would require one of the later commits to fix
* ubsan reported a "bug" with memcmp(NULL, NULL, 0) with Rdb_string_writer comparison. The fix has been back ported to 5.6.
* Tests caught a bug that THD::inc_sent_row_count now increments status var resulting double counting

Reviewed By: Pushapgl

Differential Revision: D26564619

-----------------------------------------------------------------------------------

Properly fallback if executing unsupported case

Summary:
In most cases bypass parser `select_parser` should identify unsupported case and fallback to regular query. However, there are a few checks that are expensive and is done as part of execution, and we should properly fallback in those case as well. In a future version maybe we should refactor it so that those are another phase of parsing as well, which would also help bypass RPC.

Reference Patch: facebook/mysql-5.6@5452df8

Reviewed By: luqun

Differential Revision: D26564781

-----------------------------------------------------------------------------------

Add switch to disallow filters

Summary:
For scenarios like A >= 1 and B >= 2, doing it correctly requires skip scan which isn't supported yet. Today we would just use A >= 1 as the starting point and use B >= 2 as filter to discard unwanted rows, which isn't efficient. There are other cases too, such as range query using force index A but uses key from index B. Either way our production scenarios for bypass doesn't have these scenarios but it would be good to add a switch to disallow such non-optimal cases and simply fallback to regular range query for such corner cases.

This diff adds `rocksdb-select-bypass-allow-filters ` switch to allow/block filters. It is enabled by default to make tests happy but will be disabled in prod.

Reference Patch: facebook/mysql-5.6@7cdb33a

Reviewed By: Pushapgl

Differential Revision: D26566608

-----------------------------------------------------------------------------------

Need to consider all table fields when determine index/value packing

Summary:
For scenarios like `SELECT A FROM table WHERE B > 1`, we only consider the fields in the SELECT item list but not the WHERE condition, so we end up returning garbage. This uses readset and walk through all fields, just like what we do in setup_field_decoders. I have another pending change to make the cover check take columns keys into account. Fortunately for we don't have those in prod for bypass queries yet.

Reference Patch: facebook/mysql-5.6@569f281

Reviewed By: Pushapgl

Differential Revision: D26566928

-----------------------------------------------------------------------------------

Remove field limit and support VARBINARY

Summary:
A small number of production traffic are rejected by bypass query engine because the table has more than 15 fields. This removes the unnecessary restriction. Also add VARBINARY support

Reference Patch: facebook/mysql-5.6@f1dfd5a

Reviewed By: Pushapgl

Differential Revision: D26566763

-----------------------------------------------------------------------------------

Refactoring index covering check to match MyRocks and add non-covering SK tests and rocksdb_covering_secondary_key stats

Summary:
When considering when a secondary key is covering, we need to also consider lookup bitmap for some special cases like using covering bitmap w/ prefix keys. This isn't much of an issue with 5.6 today because we don't actually enable covering bitmap format in prod but in 8.0 it might be an issue. Either way this refactors the covering check to be exactly the same as MyRocks does. Also adds more tests for covering / non covering prefix tests.

Also in order to catch cases where we incorrectly consider covering secondary keys to be non-covering, I've added rocksdb_covering_secondary_key_count support into bypass (should've done that earlier) and added checks in secondary range key tests to ensure covering check is done correctly.

Reference Patch: facebook/mysql-5.6@30e9039

----

Porting Notes:
* Fixed an issue that we need to explicitly compare with Rdb_key_def::KEY_COVERED
* The scenario with prefix key is now correctly handled and re-enabled in bypass_select_scenarios.inc

Reviewed By: Pushapgl

Differential Revision: D26566848

-----------------------------------------------------------------------------------

Unique SK Point lookup actually go through point look up path

Summary:
For unique SK point lookup, today due to SK being extended key, a full SK lookup will not go through SK point lookup and instead will go through range lookup. The fix is to use user_defined_key_parts instead of key_parts so that we only compare the 'real' SK columns to see if the key is 'full'. This doesn't impact correctness - only performance since the range query path is slightly more involved.

Next step would be making SK point lookup to actually use Get instead of iterator if it also has all PK components. MyRocks doesn't do that (yet) but bypass could.

Reference Patch: facebook/mysql-5.6@d836fc9

Reviewed By: luqun

Differential Revision: D26566923

-----------------------------------------------------------------------------------

Fix incorrect bloom filter full key check

Summary:
If the SK is non-unique key, and ends up being shorter than bloom filter, we set m_is_full_key = true if SK is fully specified which would enable bloom filter usage, despite the key itself isn't actually full (missing SK) and being shorter than bloom filter, so it would end up not being able to find the row (thanks Manuel for finding the problem in code review!). The correct fix is to use Rdb_key_def::get_key_parts which *always* has the full key parts (SK + PK) and not use actual_key_parts (SK+PK if non-unique key, and SK only if unique key) nor user_defined_key_parts (SK only).

Another interesting aspect is when we loop through all the keys in SELECT in bypass, should we use actual_key_parts / user_defined_key_parts / get_key_parts()? Ideally we should use get_key_parts which would guarantee using the most specific SK+PK and enable using bloom filter if it is ends up being shorter in the unique case. However currently bypass uses actual_key_parts which aligns with what MySQL passes to MyRocks. And in practice this perhaps matters little as it doesn't make much sense to have bloom filter to be larger than unique SK length.

This also fixes some bad bloom filter names in .opt files that I didn't update properly.

Reference Patch: facebook/mysql-5.6@64ba7e2

Reviewed By: luqun

Differential Revision: D27298776
ldonoso pushed a commit that referenced this pull request Apr 4, 2022
*Problem:*

ASAN complains about stack-buffer-overflow on function `mysql_heartbeat`:

```
==90890==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fe746d06d14 at pc 0x7fe760f5b017 bp 0x7fe746d06cd0 sp 0x7fe746d06478
WRITE of size 24 at 0x7fe746d06d14 thread T16777215

Address 0x7fe746d06d14 is located in stack of thread T26 at offset 340 in frame
    #0 0x7fe746d0a55c in mysql_heartbeat(void*) /home/yura/ws/percona-server/plugin/daemon_example/daemon_example.cc:62

  This frame has 4 object(s):
    [48, 56) 'result' (line 66)
    [80, 112) '_db_stack_frame_' (line 63)
    [144, 200) 'tm_tmp' (line 67)
    [240, 340) 'buffer' (line 65) <== Memory access at offset 340 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
Thread T26 created by T25 here:
    #0 0x7fe760f5f6d5 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x557ccbbcb857 in my_thread_create /home/yura/ws/percona-server/mysys/my_thread.c:104
    #2 0x7fe746d0b21a in daemon_example_plugin_init /home/yura/ws/percona-server/plugin/daemon_example/daemon_example.cc:148
    #3 0x557ccb4c69c7 in plugin_initialize /home/yura/ws/percona-server/sql/sql_plugin.cc:1279
    #4 0x557ccb4d19cd in mysql_install_plugin /home/yura/ws/percona-server/sql/sql_plugin.cc:2279
    #5 0x557ccb4d218f in Sql_cmd_install_plugin::execute(THD*) /home/yura/ws/percona-server/sql/sql_plugin.cc:4664
    #6 0x557ccb47695e in mysql_execute_command(THD*, bool) /home/yura/ws/percona-server/sql/sql_parse.cc:5160
    percona#7 0x557ccb47977c in mysql_parse(THD*, Parser_state*, bool) /home/yura/ws/percona-server/sql/sql_parse.cc:5952
    percona#8 0x557ccb47b6c2 in dispatch_command(THD*, COM_DATA const*, enum_server_command) /home/yura/ws/percona-server/sql/sql_parse.cc:1544
    percona#9 0x557ccb47de1d in do_command(THD*) /home/yura/ws/percona-server/sql/sql_parse.cc:1065
    percona#10 0x557ccb6ac294 in handle_connection /home/yura/ws/percona-server/sql/conn_handler/connection_handler_per_thread.cc:325
    percona#11 0x557ccbbfabb0 in pfs_spawn_thread /home/yura/ws/percona-server/storage/perfschema/pfs.cc:2198
    percona#12 0x7fe760ab544f in start_thread nptl/pthread_create.c:473
```

The reason is that `my_thread_cancel` is used to finish the daemon thread. This is not and orderly way of finishing the thread. ASAN does not register the stack variables are not used anymore which generates the error above.

This is a benign error as all the variables are on the stack.

*Solution*:

Finish the thread in orderly way by using a signalling variable.
ldonoso pushed a commit that referenced this pull request Apr 12, 2022
… enabled

Summary:
For secondaries, when enable_super_log_bin_read_only is on and read_only is on, currently it will forbid to install/uninstall plugin during run time.

install/uninstall plugin doesn't generate event in binlog, although the thread thd contains OPTION_BIN_LOG flag due to log_slave_updates is on by default in secondaries. It should be safe to execute install/uninstall plugin.

the change is to call set_skip_readonly_check() before install/uninstall plugin and call reset_skip_readonly_check()(for completeness) after install/uninstall plugin.

BTW, mysql will always call reset_skip_readonly_check() for at the beginning of each statement. thus set_skip_readonly_check() won't affect other statement.
```
#0  THD::reset_skip_readonly_check (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_class.h:1754
#1  0x0000000005a500e1 in THD::reset_for_next_command (this=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5892
#2  0x0000000005a517a5 in mysql_reset_thd_for_next_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:5817
#3  0x0000000005a50466 in mysql_parse (thd=0x7fb5c1a0b000, parser_state=0x7fb5f6bb4560, last_timer=0x7fb5f6bb39b0) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:6056
#4  0x0000000005a4c7c9 in dispatch_command (thd=0x7fb5c1a0b000, com_data=0x7fb5f6bb4d98, command=COM_QUERY) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:2222
#5  0x0000000005a4f991 in do_command (thd=0x7fb5c1a0b000) at /home/luqun/mysql/mysql-8.0.20/sql/sql_parse.cc:1556
#6  0x0000000005ccd4f1 in handle_connection (arg=0x7fb5cc85b740) at /home/luqun/mysql/mysql-8.0.20/sql/conn_handler/connection_handler_per_thread.cc:330
percona#7  0x00000000078eb95b in pfs_spawn_thread (arg=0x7fb5f8c89720) at /home/luqun/mysql/mysql-8.0.20/storage/perfschema/pfs.cc:2884
percona#8  0x00007fb5f957020c in start_thread (arg=0x7fb5f6bb6700) at pthread_create.c:479
percona#9  0x00007fb5f971881f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

Reviewed By: george-reynya

Differential Revision: D27213990
ldonoso pushed a commit that referenced this pull request Apr 12, 2022
Summary:
This is an upstream bug: https://bugs.mysql.com/bug.php?id=92964

The issue is that if slave instance use sysvar_session_track_gtids == OWN_GTID and also enable MTS, the slave will lag behind master and its performance degrades over time.

The reason for this lag/perf degrade is caused by each slave worker keep adding its own gtid into its own Session_consistency_gtids_ctx.m_gtid_set during each transaction commit. Overtime, some of Session_consistency_gtids_ctx.m_gtid_set may have millions gno_interval and it will take long time to insert 1 gtid into Session_consistency_gtids_ctx.m_gtid_set(time complexity O(n)), see [Gtid_set::add_gno_interval](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_gtid_set.cc#L323).

There are couple related code to this issue:
- During slave worker thread start, there is a [check](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1560) to not enable tracker for system threads but that check doesn't work---The THD system_thread field is initialized with default value(NON_SYSTEM_THREAD) during THD::THD.
```
  m_enabled = thd->variables.session_track_gtids != OFF &&
              /* No need to track GTIDs for system threads. */
              thd->system_thread == NON_SYSTEM_THREAD;
```
```
Call stack:
```
- in [Session_gtids_track::Update](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/session_tracker.cc#L1581), the else code doesn't unregister listener due to mysql/mysql-server@1e0844c try to fix another issue.

Solutions:
1.  [Current]add system threads check during collecting data, since THD will be fully initialized at that time.
 -Pro: simple change(easy to port) and perf is ok
2. add `thd->session_track_gtids=OFF; thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` after [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
-Pro: During init_slave_thread(), thd->system_thread will be set correctly value.
-Con: it need to add couple places, such as handle_slave_io, handle_slave_sql, handle_slave_worker
3. add `thd->session_track_gtids=OFF;thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER)->update(thd);` inside [init_slave_thread()](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_slave.cc#L6788)
   -Pro: only add once for slave/slave worker/IO threads
   -Con: [should_collect](https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.13/sql/rpl_context.cc#L53) check still return true, thus it is less perf than #1
4. add `thd->session_track_gtids=OFF;thd->rpl_thd_ctx.session_gtids_ctx().unregister_ctx_change_listener(thd->session_tracker.get_tracker(SESSION_GTIDS_TRACKER));` inside init_slave_thread()
   -Pro: perf is better than #3, on par as #1
   -Con: if there are other non-system threads(except IO/sql threads) do transactions, these threads will still collect its own gtids.
5. add another THD overload constructor which takes 'enum_thread_type' as a parameter whose default value is NON_SYSTEM_THREAD
 -Pro: it will initialize correct value with THD and won't enable those session_tracker
 -Con: similar to #2/#4, need to replace THD() with THD(enum_thread_type) couples places.

Differential Revision: D18072672
ldonoso pushed a commit that referenced this pull request Apr 12, 2022
MySQL Bypass Project Initial Change

Summary:
MySQL bypass infrastructure change

1. Adds a new hton override `handlerton::handle_single_table_select` to allow hijacking of SELECT statement execution from any storage engine.
2. Add lex parsing for bypass hints in optimizer hint style /*+ bypass */ and /*+ no_bypass */
3. MyRocks overrides handle_single_table_select but does nothing and simply fallback.

This is a port from from bypass feature branch - change #1 of 3. I'm planning to port the changes in the following order:

1. MySQL lexer + SELECT hijacking (this one).
2. SELECT Statement parsing and checking
3. Execution of SELECT statement in bypass

Reference Patch: facebook/mysql-5.6@2d19dc0

-----
Porting Notes:

* The entry points have changed due to MySQL 8.0 refactoring to SELECT path, and now it is in `Sql_cmd_dml::execute_inner`. I'll also see if overriding the entire execute would be better but this should be good enough for now.
* The check for whether we should go to bypass handler is mostly the same, except st_select_lex_t is now SELECT_LEX.
* The lex parsing for hints was mostly ported as-is. I was also debating whether I should just use optimizer hints instead of hacking the lexer, but decided against using optimizer hint for now so that migration from 5.6 will be easier. We can migrate to optimizer hint once 8.0 migration is complete.

Differential Revision: D22807040

-----------------------------------------------------------------------------------

MySQL bypass change #2 - SELECT parsing

Summary:
This is change #2 that ports the SELECT statement parsing from feature-myrocks-bypass branch, including:

1. select_parser class that parses SELECT statements and validates the scenarios are supported, and error with proper message if not supported
2. You can control bypass policy with rocksdb_select_bypass_policy variable = 0 (ALWAYS OFF) / 1 (ALWAYS ON) / 2 (OPT IN) / 3 (OPT OUT).
3. A few metrics for number of SELECT bypass statements executed/rejected (due to not supported) /failed (something wrong in execution)

At this point this only dumps the SELECT statement parsed results without doing anything useful. The next change will be porting the query plan execution and point/range query execution part.

Please ignore the unused attributes for now - they'll be removed in next change once the execution part is ported.

Reference Patch:

facebook/mysql-5.6@e79b20a
facebook/mysql-5.6@8072e3d

 ---
Porting Notes:

There are a lot of small changes in this area but nothing too bad:
* Some bypass scenarios no longer fail or fail with different error message because of the new parse tree -> AST conversion process. For example, `SELECT *` now works correctly
* st_select_lex_t -> SELECT_LEX, and many members has changed name, such as where -> where_cond()
* protocol usage has changed to start_row, store, end_row in 8.0, which is much better.
* thd->query() is now a LEX_STRING
* SELECT option check (such as SELECT_DISTINCT) now uses active_options() and only check for SELECT_DISTINCT as there seems to be default value being set most likely due to side effect of having more parse tree processing being done. I checked the flags and it looks like only SELECT_DISTINCT are interesting and needs to be blocked.
* SELECT INTO OUTFILE check is changed to use results->needs_file_priviledge
* True/False const are now represented as Item_func_true/Item_func_false. For now I'm over-generalizing to functions that are const, and expect val_int/etc would do the right thing

Reviewed By: luqun

Differential Revision: D22808305

-----------------------------------------------------------------------------------

MySQL Bypass Change #3 - Query Execution

Summary:
MySQL bypass execution of select statements. All the query plan / execution are in `select_exec` class.

When creating the query plan, the following two functions are the most important:

1. scan_value determines whether to unpack index, value, or both
2. scan_where determines which part of WHERE are indexes and builds the key with it, in index order. And the rest are filters used to include/skip values.

When it comes to unpacking - we have two code path:

1. The unpack_fast* functions are the fast path that unpack directly into text buffer m_row_buf and a mapping data structure m_send_mapping that maps from SELECT item order to a (buf, len) tuple pointing to the text, later to be send out in the SELECT item order using protocol::store_string_data (I had to expose this internal function and rename it, but I'm open for suggestions. The string overload checks the type of argument so it asserts when you call store(string) for a int field, etc. But I'm wondering if this actually expose a bug that the client would see as incorrect type and our tools won't catch it. TODO: Validate this, but this is not necessarily blocking). Note in this case we don't support filtering which is to be added in the future for simple cases.

2. The slower path unpack* functions will unpack first into record[0], use eval_cond to do filtering using MySQL Item::val_int() to evaluate the conditional expressions. When sending rows, use protocol::send_result_set_row to send the results using select_lex->items pointing to record[0].

The rest should be self-explanatory.

--------
Porting Notes:

Removed unpack_fast_path as testing showed it didn't have much impact to the performance benchmarks. And also it didn't work in 8.0 because we've made some format changes to VARCHAR/BLOBs. This would greatly simplify bypass feature and make it work much better with future format changes.

Reviewed By: luqun

Differential Revision: D22818216

-----------------------------------------------------------------------------------

Fix bypass bug in prefix decoding optimization

Summary:
This is a classic stale pointer bug discovered in data-correctness testing and we are unpacking int fields to garbage as a result.

In bypass we try to optimize prefix decoding for case like `SELECT d, c, b, a FROM t WHERE a=1 and b=2` with index being `(a, b, c, d)`. We unpack a, b and remember the location for the first time, and in future unpacking we can simply skip ahead to c, d. The problem is we remember the location of a/b in m_send_mapping with a pointer into m_row_buf (for unpacking) but that may reallocate and leading to stale pointers in m_send_mapping.

The fix is to use offset rather than pointers.

Also updated test case. But unfortunately it's hard to make sure the test actually did crash before the fix (usually it just prints out old value that are still there) - I did manually verify in debugging that it did trigger the condition.

Reference Patch: facebook/mysql-5.6@e24b0bf454c

 ---
Porting Notes:

The fix no longer applies after removing fast unpacking code. Only keeping the tests.

Reviewed By: luqun

Differential Revision: D22854608

-----------------------------------------------------------------------------------

Fix bypass bug in unpack info

Summary:
Data-correctness tool find another issue with primary key unpacking that it incorrectly treat 0x3 value as unpacking info because we were doing the wrong unpack info check.

The way unpack info is detected for primary key and secondary key are different: for primary key we check all the keys in setup_field_encoder while for secondary key we check the unpack tag (which should always be there). In bypass we are not distingushing between primary key / secondary key unpack info decoding separately (and instead we always do both, which is incorrect). This fixes that and adds a test.

Also, decode_value_header is only intended for unpacking pk and never secondary keys because it uses m_maybe_unpack_info which is only intended for primary key scenario. I've renamed the function to explicitly indicate it is for pk scenario only.

Reference Patch: facebook/mysql-5.6@f1664c144d1

 ---
Porting Notes:

Some fast unpacking related code are removed.

Reviewed By: luqun

Differential Revision: D22856130

-----------------------------------------------------------------------------------

Use enum for mysql_select_bypass_policy to make it more user friendly

Summary:
We should be using enum instead of integer values for the policy as enum values are more user friendly, and you can use both integer values and the string names. The new names are "always_on", "always_off", "opt_in", "opt_out". (I started with use "off" and "force" but then realized force is a keyword..)

Reference Patch: facebook/mysql-5.6@698d945e7a3

Reviewed By: luqun

Differential Revision: D22856678

-----------------------------------------------------------------------------------

Reporting rejected bypass queries into I_S

Summary: Reporting rejected bypass queries into information_schema instead of error log

Differential Revision: D24106900

-----------------------------------------------------------------------------------

Bypass rejected query normalization

Summary:
For protecting privacy of clients, we need to hide the detailed query information while recording a rejected bypass query. To do that, we digest/normalize each rejected bypass query before write it into the RDB_I_S. Where a 'digested query' is a textual representation of a query, where:
  - comments are removed,
  - non significant spaces are removed,
  - literal values are replaced with a special '?' marker,
  - lists of values are collapsed using a shorter notation

Differential Revision: D24155757

-----------------------------------------------------------------------------------

Fix rows_read and queries_range stats in bypass

Summary:
Bypass updates rows_sent and rows_examined but didn't update rows_read. This fixes that and adds tests. We also get rocksdb_queries_point and rocksdb_queries_range for free as we are using MyRocks API to access data. Note I didn't include rows_examined in the test because querying unrelated information_schema tables will affect rows_examined and it is kinda flaky depending on many factors like perf schema. I did found a double counting issue with rocksdb_queries_range and fixed it as well.

Reference Patch: facebook/mysql-5.6@6106eba

----
Porting Notes: Removed updates to table stats which is not ported to 8.0.

Differential Revision: D25809584

-----------------------------------------------------------------------------------

Fix bypass range query

Summary:
In current bypass implementation, its range query implementation is not correct. For A >= begin and A <= end , it'll only pick begin *or* end as start of the range, but it will only use the prefix as the end of the range, which means it can potentially scan a lot more rows than needed. It rely on the condition A >= begin / A <= end to filter out the unnecessary rows so the end result is correct (that's why this issue is not discovered in testing).

The correct way to implement is to set up (begin, end) range slice correctly, and do NOT rely on any conditional evaluation to filter rows. The tricky part is to determine the correct begin/end based on a few factors:
* forward / reverse column family
* ascending / descending orders
* inclusive vs non-inclusive ranges (>=, <= vs <, >)
* prefix query vs range query

For better testing, I've done the following:
* verify_bypass_query.inc that will run the query in bypass and non-bypass, verify the row reads are same or less than non-bypass case to ensure such issue won't happen again. Most bypass query validation are moved to use verify_bypass_query.inc. We can also add more validation in the future in verify_bypass_query.inc as needed. As a bonus, it'll make verifying the results are the same in bypass/non-bypass case easy as well.
* move range key validation from bypass_select_scenarios into bypass_select_range_pk and bypass_select_range_sk
* added more test cases to bypass_select_range_pk / bypass_select_range_sk, especially for PK case where some scenarios are missing
* dump query results to file and verify query results didn't change w/ and w/o bypass. I had to back port `output` mysqltest support to make this work properly (for some reason --exec $MYSQL doesn't like multiline queries".

For review, there is no need to go through results file as there are validation in-place to make sure query return same results w/ and w/o bypass.

Reference Patch: facebook/mysql-5.6@ef9a677

------------
Porting Notes:

* Disabled a scenairo with TEXT BLOB prefix key scenario that would require one of the later commits to fix
* ubsan reported a "bug" with memcmp(NULL, NULL, 0) with Rdb_string_writer comparison. The fix has been back ported to 5.6.
* Tests caught a bug that THD::inc_sent_row_count now increments status var resulting double counting

Reviewed By: Pushapgl

Differential Revision: D26564619

-----------------------------------------------------------------------------------

Properly fallback if executing unsupported case

Summary:
In most cases bypass parser `select_parser` should identify unsupported case and fallback to regular query. However, there are a few checks that are expensive and is done as part of execution, and we should properly fallback in those case as well. In a future version maybe we should refactor it so that those are another phase of parsing as well, which would also help bypass RPC.

Reference Patch: facebook/mysql-5.6@5452df8

Reviewed By: luqun

Differential Revision: D26564781

-----------------------------------------------------------------------------------

Add switch to disallow filters

Summary:
For scenarios like A >= 1 and B >= 2, doing it correctly requires skip scan which isn't supported yet. Today we would just use A >= 1 as the starting point and use B >= 2 as filter to discard unwanted rows, which isn't efficient. There are other cases too, such as range query using force index A but uses key from index B. Either way our production scenarios for bypass doesn't have these scenarios but it would be good to add a switch to disallow such non-optimal cases and simply fallback to regular range query for such corner cases.

This diff adds `rocksdb-select-bypass-allow-filters ` switch to allow/block filters. It is enabled by default to make tests happy but will be disabled in prod.

Reference Patch: facebook/mysql-5.6@7cdb33a

Reviewed By: Pushapgl

Differential Revision: D26566608

-----------------------------------------------------------------------------------

Need to consider all table fields when determine index/value packing

Summary:
For scenarios like `SELECT A FROM table WHERE B > 1`, we only consider the fields in the SELECT item list but not the WHERE condition, so we end up returning garbage. This uses readset and walk through all fields, just like what we do in setup_field_decoders. I have another pending change to make the cover check take columns keys into account. Fortunately for we don't have those in prod for bypass queries yet.

Reference Patch: facebook/mysql-5.6@569f281

Reviewed By: Pushapgl

Differential Revision: D26566928

-----------------------------------------------------------------------------------

Remove field limit and support VARBINARY

Summary:
A small number of production traffic are rejected by bypass query engine because the table has more than 15 fields. This removes the unnecessary restriction. Also add VARBINARY support

Reference Patch: facebook/mysql-5.6@f1dfd5a

Reviewed By: Pushapgl

Differential Revision: D26566763

-----------------------------------------------------------------------------------

Refactoring index covering check to match MyRocks and add non-covering SK tests and rocksdb_covering_secondary_key stats

Summary:
When considering when a secondary key is covering, we need to also consider lookup bitmap for some special cases like using covering bitmap w/ prefix keys. This isn't much of an issue with 5.6 today because we don't actually enable covering bitmap format in prod but in 8.0 it might be an issue. Either way this refactors the covering check to be exactly the same as MyRocks does. Also adds more tests for covering / non covering prefix tests.

Also in order to catch cases where we incorrectly consider covering secondary keys to be non-covering, I've added rocksdb_covering_secondary_key_count support into bypass (should've done that earlier) and added checks in secondary range key tests to ensure covering check is done correctly.

Reference Patch: facebook/mysql-5.6@30e9039

----

Porting Notes:
* Fixed an issue that we need to explicitly compare with Rdb_key_def::KEY_COVERED
* The scenario with prefix key is now correctly handled and re-enabled in bypass_select_scenarios.inc

Reviewed By: Pushapgl

Differential Revision: D26566848

-----------------------------------------------------------------------------------

Unique SK Point lookup actually go through point look up path

Summary:
For unique SK point lookup, today due to SK being extended key, a full SK lookup will not go through SK point lookup and instead will go through range lookup. The fix is to use user_defined_key_parts instead of key_parts so that we only compare the 'real' SK columns to see if the key is 'full'. This doesn't impact correctness - only performance since the range query path is slightly more involved.

Next step would be making SK point lookup to actually use Get instead of iterator if it also has all PK components. MyRocks doesn't do that (yet) but bypass could.

Reference Patch: facebook/mysql-5.6@d836fc9

Reviewed By: luqun

Differential Revision: D26566923

-----------------------------------------------------------------------------------

Fix incorrect bloom filter full key check

Summary:
If the SK is non-unique key, and ends up being shorter than bloom filter, we set m_is_full_key = true if SK is fully specified which would enable bloom filter usage, despite the key itself isn't actually full (missing SK) and being shorter than bloom filter, so it would end up not being able to find the row (thanks Manuel for finding the problem in code review!). The correct fix is to use Rdb_key_def::get_key_parts which *always* has the full key parts (SK + PK) and not use actual_key_parts (SK+PK if non-unique key, and SK only if unique key) nor user_defined_key_parts (SK only).

Another interesting aspect is when we loop through all the keys in SELECT in bypass, should we use actual_key_parts / user_defined_key_parts / get_key_parts()? Ideally we should use get_key_parts which would guarantee using the most specific SK+PK and enable using bloom filter if it is ends up being shorter in the unique case. However currently bypass uses actual_key_parts which aligns with what MySQL passes to MyRocks. And in practice this perhaps matters little as it doesn't make much sense to have bloom filter to be larger than unique SK length.

This also fixes some bad bloom filter names in .opt files that I didn't update properly.

Reference Patch: facebook/mysql-5.6@64ba7e2

Reviewed By: luqun

Differential Revision: D27298776

-----------------------------------------------------------------------------------

Introduce base select parser class

Summary:
To make it easier to add a new bypass rpc parser into existing bypass implementation (nosql_access.cc),
and make minimum change to bypass engine while adding a bypass rpc parser, I introduced a new base class,
base_select_parser. It includes all essential members and methods for parsing bypass sq/rpc queries.
Existing bypass sql parser (select_parser) and a new bypass rpc parser (not included in this diff)
inherits the base function.

Bypass RPC reference diff: D30493654

Reviewed By: yizhang82

Differential Revision: D31870628
ldonoso pushed a commit that referenced this pull request Apr 12, 2022
point relay logs to binlog

Summary:
When a raft ring goes through  a leader change, log file (and caches) need to be
updated correctly. Especially relay logs need to be set as raft logs (i.e
converted to binary log naming convention from relay log naming convention). The
file offsets need to be set correctly for sql appliers to start reading later.
This is also true when the node starts off for the first time. This diff bring
ins the support for 5.6 to 8.0.

Note that in 8.0, interfaces to binlog and interaction with sql appliers
have changed significantly. Hence there are multiple changes compared to 5.6.
Few notable changes compared to 5.6:
1. rli->init_relay_log_pos() is called in 5.6 to setup the relay log cursor,
close cur_log/cur_log_fd/hot_log and  reconcile the sql appliers read position
(through cur_log, cur_log_fd and. This function does not exist in 8.0. My
understanding so far is that one needs to set group_relay_log_name and
group_relay_log_pos - subsequently sql appliers will initialize their reader
state (through Rpl_allier_reader) based on these. hot_log/cur_log does not
exist in 8.0. The main relay log is already closed in relay_log_raft_reset()
and we do not have to do anything special. (ofcourse all of this will be tested
later during integration with plugin)
2. In 5.6, relay log's io-cache is opened with SEQ_READ_APPEND flag. This is
because the io-thread and sql-threads might share the hot_log/cur_log where
io-thread will be appending (writes) and sql thread will be reading. This
mechanism does not exist in 8.0. SQL appliers use a different mechanism to read
the relay log and hence MYSQL_BIN_LOG always opens the underlying io-cache in
WRITE mode. Hence there is no need to pass io-cache-type to functions that open
binlog/relay-log file.
3. rli does not maintain linfo (the position in the index file that the sql
applier needs to read next). hence the change around this is dropped.

Differential Revision: D22553192

--------------------------------------------------------------------------

port purge_logs hook and safe purge log support from 5.6

Summary: see title

Reviewed By: luqun

Differential Revision: D23887810

Port : Implementing SHOW RAFT STATUS COMMAND

Summary: see tittle.

Differential Revision: D23899017

--------------------------------------------------------------------------

use correct replay log IO_CACHE pos

Summary: Relay log last position is recorded in atomic_binlog_end_pos and pass this value to raft plugin to append trx events.

Differential Revision: D24434583

--------------------------------------------------------------------------

Add RaftListenerCallbackType::GET_EXECUTED_GTIDS

Summary:
in MySQL 8, global variable gtid_state contains all GTID related informations.
use gtid_state.get_executed_gtids() get fetch execute gtids to raft plugin

Differential Revision: D24430809

--------------------------------------------------------------------------

Add change master and reset slave support

Summary:
Add Change master and reset slave raft event
Add logic to set read only extra msg during during change master and reset slave

Differential Revision: D24164576

--------------------------------------------------------------------------

use new IO_CACHE to pack all binlog events

Summary: In MySQL 8.0, m_cache during binlog_cache_data::flush() doesn't contain gtid event but raft requires one IO_CACHE contains all events(gtid event, hlc event, transactions event), create new IO_CACHE to pack all of these events.

Differential Revision: D24431830

--------------------------------------------------------------------------

Disable Create SLAVE IO thread and disable purge relay log by worker thread

Summary:
When raft plugin is enabled, don't create slave IO thread and don't purge relay log by worker thread.
Raft plugin will pump transaction message to slave worker thread and also purge relay log.

Differential Revision: D24512006

--------------------------------------------------------------------------

Add rpl_raft.rpl_raft_basic testcase

Summary:
add rpl_raft.rpl_raft_basic testcase

Changes(comparing with 5.6):
 in 8.0,  specify both --replicate-same-server-id and --log-slave-updates will generate a waring, use mtr.call to ignore
 in 8.0, rpl_init.inc will expose $_rpl_master as variable for primary
 in 8.0, serveral variables are moved from information_schema to performance_schema, such as global_status, etc

Differential Revision: D23966751

--------------------------------------------------------------------------

Add raft_cur_log_ext field in MYSQL_BIN_LOG

Summary:
Raft plugin tracks cur_log_ext value when rotate logs.

In MySQL 5.6, cur_log_ext is a filed in MYSQL_LOG.
In MySQL 8.0, cur_log_ext field is removed and use max_number in log directory.

Due to  cur_log_ext is used during ordered_commit, add a field is easier to minimize perf impact.

Differential Revision: D24531649

--------------------------------------------------------------------------

auto deletion of apply logs based on sysvars

Summary:
Ported D22095395 (facebook/mysql-5.6@d7dbc9d) (facebook/mysql-5.6@d7dbc9d).
**Related Summary**
Apply logs are the new trx log on raft followers. We should be aggressive in
reclaiming space consumed by apply logs. This diff adds the support for auto
deletion of apply logs.

Approach:
1. Two new sysvars are added apply_log_retention_num (defaulted to 10 files) and
apply_log_retention_duration (defaulted to 15 mins). These impose (soft) limits
on minimum number of apply logs that need to be retained (in terms of
'number-of-files') and minimum duration of retention (in terms of minutes since
last modified).

2. Auto deletion is triggered whenever an apply log gets rotated after commit of
a trx (in ordered_commit) as part of max_binlog_size check. Auto deletion will
trigger if there are more than "apply_log_retention_num" number of files. The
files which were last modified more than "apply_log_retention_duration" minutes
in the past will be auto deleted (provided these files are not being actively
used currently)

3. To keep things simple, the new limits are only soft limits. Pathological
cases like a user triggering multiple 'flush logs' and system not seeing any
trx activity for a prolonged period is not handled.

**Notes**
There was one usage in MYSQL_BIN_LOG::init_index_file() where we modify apply_file_count earlier. I didn't find the same code in 8.0. Let me know if I am missing something.

Reviewed By: luqun

Differential Revision: D24461863

--------------------------------------------------------------------------

Port new commands to manage raft logs

Summary: Related diff D22211549 (facebook/mysql-5.6@ce0a8f7) and D23668807 (facebook/mysql-5.6@86919ca)

Reviewed By: luqun

Differential Revision: D24492697

--------------------------------------------------------------------------

Port: Consolidating all Raft Rotation Info in one struct

Summary:
Ported D22567487 (facebook/mysql-5.6@1142e3c) (facebook/mysql-5.6@1142e3c) and D23634741 (facebook/mysql-5.6@bdf8f33) (facebook/mysql-5.6@bdf8f33)
**Summary**
* In Raft when HLC is enabled, we want both binlogs and relay
logs which are both raft logs should have the same size and same binary
match from the end of the previous HLC event, or start of the
transactions. This ensures that the end log pos is consistent. This is
required so that the relay-log.info file has the proper position of
applied events. In order to achieve this, we have to make sure that the
top part of the binlog file is identical (since this is not replicated).
The top part contains format descriptor, previous gtid and previous hlc
events. In this diff, we make sure that a previous hlc is also put in
raft relay logs.
* This is for easy extensibility, as with server and BLS
rotate handling has seen more churn. The obvious issue with flattened
raft params, is that the function signatures keep growing larger.
Instead now, we have one struct and use a default param of nullptr
to convey non-raft mode.

Reviewed By: luqun

Differential Revision: D24628485

--------------------------------------------------------------------------

Implement logic to set binlog sync variable

Summary: The Raft plugin will need to tell the server to set this variable when it changes from leader to follower and vice-versa.

Differential Revision: D24636783

--------------------------------------------------------------------------

Port error handling for file rotation flow

Summary:
Port D23398966 (facebook/mysql-5.6@d780a74) (facebook/mysql-5.6@d780a74)
**Summary**
If rotation of binlog fails, then the server either aborts or disable binlogging (based on binlog_error_action = ABORT_SERVER or IGNORE_ERROR). The rationale here is probably that file rotation could fail only due to system issues (out of disk space, file system error etc).
However, crashing the server is not desirable when raft is enabled. Rotation of file is also a consensus operation and could fail due to consensus errors (not being able to replicate to majority). Hence, if there is a consensus error during file rotation, then server should not crash or disable binlogging. This diff addresses the problem that when raft is enabled (enable_raft_plugin) and the raft hooks return an error, then these error will not crash the server. A new error is added (ER_RAFT_FILE_ROTATION_FAILED) which is returned back to the client.

Reviewed By: luqun

Differential Revision: D24677031

--------------------------------------------------------------------------

Disallow disruptive command RESET SLAVE in raft mode

Summary:
Port D22623730 (facebook/mysql-5.6@67b33d3) (facebook/mysql-5.6@67b33d3)
**Summary**
If reset slave is done in raft, the binlog file
indexes(suffixes) are reset all binlog files are purged.
This messes with log index on kuduraft side. We recently observed
that in production shadow testing scripts were calling reset slave
disrupting raft rings and causing KEY NOT FOUND errors in replication
apply. Now we disallow RESET SLAVE and give error on SQL connection
whenever enable_raft_plugin is turned on.
In order to also comply with current MTR flow, which sets
enable_raft_plugin in CNF and also needs RESET SLAVE, a backdoor
is required for the hacky MTR flow. Added raft_mtr_mode to bypass the
above check.

Reviewed By: luqun

Differential Revision: D24678918

--------------------------------------------------------------------------

add rpl_raft.raft_io_thread raft MTR

Summary: enable_raft_plugin should not allow slave io thread and vice-versa, Also enable_raft_plugin should not coexist with binlog_error_action =ABORT_SERVER

Differential Revision: D24633339

--------------------------------------------------------------------------

always exit raft queue thread

Summary:
During RaftListenerQueue destructor, check to see whether exist raft queue thread
also during exit raft queue thread, wait until the exit event is finished

Differential Revision: D24684936

--------------------------------------------------------------------------

New variable to disable raft log repointing

Summary:
New variable `disable_raft_log_repointing` to disable log repointing. Currently, this is used to keep logging intact in vanilla mtr suites.

Other changes:
Support for filtering out metadata event to keep the results of
vanilla mtr tests same in raft mode.

Differential Revision: D24531659

--------------------------------------------------------------------------

Port : Adding previous_raft_opid in previous HLC metadata event

Summary:
Port D22500383
**Summary**
In order to not have to traverse the previous
file to find the last opid, we start carrying over the last
opid of the previous file into the new file. This opid will
be the opid of the rotate event for binlogs.
For Relay logs/follower logs, the rotation opid is passed
in rotate_opid to rotation function.
For binlogs, we have the opid from the before_commit phase

Reviewed By: luqun

Differential Revision: D24677347

--------------------------------------------------------------------------

ability to remove gtid from logged gtids

Summary:
port D21375044 (facebook/mysql-5.6@7fc9eaa) (facebook/mysql-5.6@7fc9eaa) and D17291883 (facebook/mysql-5.6@6a6a35f) (facebook/mysql-5.6@6a6a35f)
**Summary**
* As part of leadership changes, we will trim binlogs/relay logs on a
running mysql instance. This needs the ability to remove gtids from executed_gtid set.
This diff adds the ability to do the same. One can enqueue an event in the raft
listener queue and the raft thread in the server will take care of removing and
updating executed_gtid on a running instance.
* When raft logs are trimmed as part of TruncateOpsAfter(), the gtid of the trimmed trxs are also removed from executed/logged gtids. However, when binlog files are rotated, the previous gtid written to the new file depends on what files are rotated. During  binlog rotation the prev gtid are the logged/executed gtids of the instance. During relay log rotation prev gtids are the retrieved gtids of the instance. Hence, in addition to trimming logged gtids/executed gtids, we should also trim retrieved gtids. This prevents creation of holes in the instances executed_gtid set as the replicaset goes through multiple promotions over its lifetime.

Reviewed By: luqun

Differential Revision: D24690007

--------------------------------------------------------------------------

clear and set additional fields during promotion

Summary:
Multiple fixes to correctly maintain mysqld state during promotion/demotion.
1. is_slave: This flag is used to check if the instance is in a mysql 'slave'
role. Among other things, unsetting read-only relies on this flag. Unsetting
read-only fails on slave instances. This was not being cleared on promotion to a
master causing issues when automation tries to unset read-only after disabling
raft.
2. Flush active_mi to make it durable during promotion/demotion

Differential Revision: D24616131

Port: reinitializing previous_gtid_set_map in binlog_change_to_apply

Summary:
Ported D23801889
**Summary**
A recent crash was discovered in SHOW BINARY LOGS WITH GTID
right after a raft bootstrap. The crash was because the previous_gtid_set_map
map is not uninitialized properly during binlog_change_to_apply. This
was an overlook. binlog_change_to_apply is unique in the repointing
because the apply logs previous gtid set is initialized with the
gtid_executed of the previous primary or from the previous applier. Its
not initialized from files. When raft is bootstrapped for first time there could
be old apply-3306.index file on disk and several apply-3306.# files.
However if an instance bootstraps as FOLLOWER, it executes
binlog_change_to_apply which has an old previous_gtid_set_map containing
mappings from binary-3306.# files to its GTID set. Now if we do
SHOW BINARY LOGS WITH GTID a predictable crash happens.
As a part of this fix, we clean up/purge apply-3306.index except the
last active file. After this a trimmed version of init_gtid_sets
is called, which sets previous_gtid_set_map properly.

Reviewed By: luqun

Differential Revision: D24631274

--------------------------------------------------------------------------

Adding support for config changes string rotate event

Summary:
Details:
log event support for generic raft provided string in metadata event
do a rotate and write config string to metadata event

Differential Revision: D24669995

--------------------------------------------------------------------------

Add Format_description_log_event into master_info

Summary:
These code path is similar to sql IO_Thread queue_event.
In MySQL 8.0, IO thread need to add Format_description_log_event()
https://github.com/facebook/mysql-5.6/blob/fb-mysql-8.0.17/sql/rpl_slave.cc#L2781

Differential Revision: D24434824

--------------------------------------------------------------------------

Use DBUG_TRACE instead of DBUG_ENTER

Summary:
several testcase failed in asan debug,

==106552==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x7f38f3590078 at pc 0x0000093aa495 bp 0x7f38f358ffc0 sp 0x7f38f358ffb8
READ of size 4 at 0x7f38f3590078 thread T203
    #0 0x93aa494 in _db_enter_(char const*, int, char const*, unsigned int, _db_stack_frame_*)
    #1 0x8d0b9bb in Relay_log_info::remove_logged_gtids
    #2 0x8b1ec3f in trim_logged_gtid
    #3 0x8c767cf in process_raft_queue

If Use DBUG_ENTER, then you need to use DBUG_RETURN to pop current frame in CODE_STATE.
If use DBUG_TRACE, it will pop current frame during .dtor

ps. small refactor changes for sql/rpl_handler.cc

Reviewed By: bhatvinay

Differential Revision: D28809418
ldonoso pushed a commit that referenced this pull request Apr 12, 2022
recover raft logs by removing partial trxs

Summary:
Port D24628821

mysqld removes partial trxs in the tail of trx log (named binary-logs on
primaries and apply-logs on secondaries) during startup. However, relay logs
were not of much importance since it was anyways discarded and a new one would
be created.
However, with raft, this is not ideal. Relay logs are raft logs on secondaries
and have to be kept around (and kept sane and consistent). This diff adds the
ability to remove partial trxs from raft/relay logs.
Much of the code to open the last relay log (based on relay log index) and
identify partial trxs is borrowed from existing logic in
MYSQL_BIN_LOG::open_binlog() and binlog_recover()

Reviewed By: Pushapgl

Differential Revision: D26447448

---------------------------------------------------------------------

Checking for inited bool to make sure global_init_info was successful

Summary:
Port D25584004

A master.info and relay.info file can be present
but needs to be properly inited for use. We were bypassing the inited
check which could lead to issues in Raft.
In case there is an error in global_init_info, Raft will do a
raft_reset_slave and make another attempt at it. If both recourses
fail, the init of the plugin would fail.

Reviewed By: Pushapgl

Differential Revision: D26447457

---------------------------------------------------------------------

Support for dumping raft logs to vanilla async replicas

Summary:
[Porting Notes]
We want to dump raft logs to vanilla async replicas regardless
of whether it's the relay log or binlog. Effectively after this change
we'll dump relay logs on the followers and binlogs on the leader. When
the raft role changes, the logs to the dumped are also changed.
Dump_log class is introduced as a thin wrapper/continer around
mysql_bin_log or rli->relay_log and is inited with mysql_bin_log
to emulate vanilla mysql behavior. Dump threads use the global
dump_log object instead of mysql_bin_log directly. We switch the log
in dump log only when raft role changes (in binlog_change_to_binlog()
and binlog_change_to_apply_log()).
During raft role change we take all log releated locks (LOCK_log,
LOCK_index, LOCK_binlog_end_pos, and dump log lock) to serialize it with
other log operations like dumping logs.

Related doc - https://fb.quip.com/oTVAAdgEi4zY

This diff contains below 7 patches:
D23013977
D24766787
D24716539
D24900223
D24955284
D25174166
D25775525

Reviewed By: luqun

Differential Revision: D26141496

---------------------------------------------------------------------

Fixed the flaky raft test suite

Summary:
First clean run of entire raft test suite :)

**Changes**
* Reset apply_logs on raft secondaries before the start of every test
* Increased the lock timeouts and changed isolation level in rpl_raft_slave_out_of_order_commit.

Reviewed By: luqun

Differential Revision: D26651257

---------------------------------------------------------------------

return error from Raft_replication_delegate when plugin is not available

Summary:
Port D23065441 (facebook/mysql-5.6@b9067f7)

The new macro is used to call into raft plugin. If plugin gets unloaded
accidentally when enable_raft_plugin is ON, then this STRICT version returns
failure. This is to be called only by raft plugin currently

Reviewed By: Pushapgl

Differential Revision: D26447523

---------------------------------------------------------------------

Adding timestamps for raft rotates which happen in the context of listener thread

Summary:
Port D25572614

The timestamp of a binlog event is picked up from the when field in
the event. In most cases of rotation, the when is left unpopulated
during rotation for the top 3 events (fd, pgtid, metadata). However
in such a situation, a normal rotate (flush binary logs) still manages
to get a valid timestamp, since the thread in which the flush binary
logs happens has a valid start time.
Now enter Raft relay log rotations. In those cases and in the case
of config change rotate, the rotations are happening in the context
of a raft listener queue thread. In that context, the when and start
time of the thread are both 0. The diff handles this case by populating
the when field appropriately.

Reviewed By: bhatvinay

Differential Revision: D26194612

---------------------------------------------------------------------

Raft abrupt stepdown and trim binlog file / gtid test

Summary: binlog file should get trimmed for abrupt stepdown

Reviewed By: Pushapgl, bhatvinay

Differential Revision: D26169975

---------------------------------------------------------------------

Port: Fixes around raft log truncation.

Summary:
**Notes**
* New functions to block and unblock dump threads for plugin to use
during raft log truncation.

Below check is already done in raft plugin as part of raft plugin in D26866429.
* Re-init rli->cur_log in Relay_log_info::rli_init_info() instead of
just seeking to the beginning to handle the case when raft log is
truncated before starting the applier

Reviewed By: luqun

Differential Revision: D26759813

---------------------------------------------------------------------

rebase due to relay log Format_description_event event size difference

Summary:
WL#3549: Binlog Compression add extra 1 bytes data into Format_description_event

  3298 @@ -134,6 +137,7 @@ Format_description_event::Format_description_event(uint8_t binlog_ver,
  3299            VIEW_CHANGE_HEADER_LEN,
  3300            XA_PREPARE_HEADER_LEN,
  3301            ROWS_HEADER_LEN_V2,
  3302 +          TRANSACTION_PAYLOAD_EVENT,
  3303        };

Reviewed By: Pushapgl

Differential Revision: D27145095

---------------------------------------------------------------------

fix raft change string metadata event

Summary:
Saw a bunch stage-1 replicaset instance failed due to config change string failure during add/remove instance
```
I0401 00:16:10.894434 1823842 IoCacheUtils.cpp:333] getConfigChangeString eventType = ^G
E0401 00:16:10.894451 1823842 IoCacheUtils.cpp:363] Failed to read metadata event body from cache
E0401 00:16:10.894456 1823842 MysqlRaft.cpp:659] [replicate] Failed to get config change string from iocache
2021-04-01T00:16:10.894464-07:00 8307 [ERROR] [MY-010207] [Repl] Run function 'before_flush' in plugin 'RPL_RAFT' failed
2021-04-01T00:16:10.894478-07:00 8307 [ERROR] [MY-000000] [Server] Failed to rotate binary log
```

After some investigation, the issue is caused is that calculate metadata event length with config change string but forgot write config change string into event body.

Reviewed By: Pushapgl

Differential Revision: D27504157

---------------------------------------------------------------------

Tells raft mode in binlog

Summary:
WIn-Win: Tell whether this binlog is generated while this host is in raft mode.

We already has the bit in FD event, and this diff just speaks it out.

Reviewed By: mpercy

Differential Revision: D28664300

---------------------------------------------------------------------

fix flaky raft testcase

Summary:
There are multiple issues for MTR:
1. in sql/rpl_binlog_sender.cc, if secondaries IO thread receives fatal_error,
   it will quit IO thread instead of reconnect. use unknown error so that
   secondary IO thread can try to reconnect
2. mtr.add_suppression() call: mtr.add_suppression() will execute an insert
   mysql statement into mtr.test_suppressions table and mtr.test_suppressions
   table doesn't contain primary key, thus during idempotent recovery, secondary
   will fail to execute mtr.add_suppression() due to missing PK. Try to move all
   mtr.add_suppression() at the end of testcase to workaround  idempotent recovery failure.
3. When promotion, use raft_promote_to_leader.inc instead of `set rpl_raft_new_leader_uuid`,
   since raft_promote_to_leader will wait the new primary state becomes writeable
4. pass specific warning instead of '.*' to mtr.add_supression
5. etc

Reviewed By: Pushapgl, bhatvinay

Differential Revision: D28774820

---------------------------------------------------------------------

Using locks in Dump_log methods only when raft is enabled

Summary:
We don't need to take locks in non-raft mode since the
underlying MYSQL_BIN_LOG obj will never change. To handle race between
updation of enable_raft_plugin var and using its values in Dump_log we
kill and block all dump threads while updating the var and unblock them
once the var is updated.

Reviewed By: bhatvinay

Differential Revision: D28905522

---------------------------------------------------------------------

leader election to be a sync point in the new leader

Summary:
Port of D27582002 (facebook/mysql-5.6@39c70ca) from 5.6.35 to 8.0

Newly elected raft leader makes sure that all trxs from the previous
leader is committed by sql appliers. It then switches the server's trx
logs from apply-log-* to binary-log-*. To other part of the system this
looks like a rotation, but the necessary sync calls are not made here.
So, if the server (or os) restarts, then the storage engine could lose
the commit markers of the last batch of trxs. This will result in silent
data drift.
This diff fixes the problem by making an explicit call to
ha_flush_logs() before switching the server's trx logs

Reviewed By: luqun

Differential Revision: D28880407

---------------------------------------------------------------------

Error out SQL thread on out of order opids in raft logs

Summary:
OpIds should always be in order in the raft logs. Added a check
in SQL threads that thows an error and stops the applier when out of
order opids are detected.

Reviewed By: li-chi

Differential Revision: D28810840

---------------------------------------------------------------------

add GET_COMMITTED_GTIDS for raft

Summary:
During recovery Raft Log::Init needs to check with server
what gtids have been committed. Before doing that it finds the entire
set of trxs in the last raft log. The difference between logged -
committed are the pending opids.

Reviewed By: Pushapgl

Differential Revision: D29622786

---------------------------------------------------------------------

raft: skip mts_recovery_groups during start slave

Summary:
During MySQL8+Raft DMP, some instance fail to switch to Leader or start slave

```
2021-06-24T17:56:38.627423-07:00 431 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name /data/mysql/3127/bls-unittestdb658.frc2-3305-mysql.replicaset.180021/binary-logs-3727.000033, event_master_log_pos 1129.
2021-06-24T17:56:38.627473-07:00 431 [ERROR] [MY-010575] [Repl] Error looking for file after /binlogs/binary-logs-3307.000120.
2021-06-24T17:56:38.627516-07:00 431 [ERROR] [MY-000000] [Repl] load_mi_and_rli_from_repositories: rli_init_info returned error
```

similar to 5.6, we don't need to run mts_recovery_groups due to GTID_MODE is always enabled.

Reviewed By: Pushapgl

Differential Revision: D29520066

---------------------------------------------------------------------

update RaftListenerCallbackArg struct

Summary: Due to contract between raft plugin and mysql change, update RaftListenerCallbackArg struct to add  master_uuid field

Reviewed By: Pushapgl, bhatvinay

Differential Revision: D29981349

---------------------------------------------------------------------

always check mi during raft_reset_slave

Summary: add similar nullptr check for raft_reset_slave as raft_stop_sql_thread

Reviewed By: bhatvinay

Differential Revision: D29967915

---------------------------------------------------------------------

raft mtr: update rpl_end_raft.inc for disable-raft

Summary:
rpl_end_raft.inc will unregister raft plugin which will call reset slaves when online disable-raft is enabled. move rpl_end_raft.inc after rpl_stop_slaves.inc to stop slave correctly first.

after stop slaves, call mtr.add_suppression("") won't replicate to slaves. just call mtr.add_suppression("") for all instances(replicas).

Reviewed By: yizhang82

Differential Revision: D30062236

---------------------------------------------------------------------

fix master_uuid

Summary: fix master_uuid in raft mode.

Reviewed By: luqun

Differential Revision: D30261465

---------------------------------------------------------------------

incorrect File_size value in show raft logs result

Summary:
in show raft logs result, the File_size for latest logs file isn't updated correctly.

such as show raft logs;

```
+-------------------------+------------
| Log_name                | File_size  |
| binary-logs-3304.000670 |       1529 |
```

in fact
```
-rw-r----- 1 mysql backup 1669180487 Aug  4 14:49 binary-logs-3304.000670
-rw-r----- 1 mysql backup 1723994154 Aug  4 14:49 binary-logs-3304.000670
```
the  file_size from show raft logs is always 1529 but the real file_size is 1669180487.

The issue is related when writing IO_CACHE directly, its wrapper ostream's position isn't updated.

Reviewed By: yizhang82

Differential Revision: D30116345

---------------------------------------------------------------------

Add gtid_purged_for_tailing

Summary:
Add a new global variable gtid_purged_for_tailing.

It shows the purged GTID for binlog in non-raft mode, and purged GTID for raft log in raft mode.

Reviewed By: abhinav04sharma

Differential Revision: D29372776

---------------------------------------------------------------------

disable skip_setup_replica_if_unnecessary

Summary: After sync with latest official raft plugin, most of MTR failed due to skip_setup_replica_if_unnecessary optimize.

Reviewed By: yizhang82

Differential Revision: D30821648

---------------------------------------------------------------------

latency histograms for raft trx wait

Summary: Port of support added for the same in mysql 5.6. Should help monitor latencies in 8.0 + raft stage -1 replicasets.

Reviewed By: luqun

Differential Revision: D31064764

---------------------------------------------------------------------

handle cases where clean shutdown in raft aborts trxs

Summary: This is a port of the feature in 5.6.

Reviewed By: anirbanr-fb

Differential Revision: D31070593

---------------------------------------------------------------------

fix crash during binlog purging

Summary:
Any error returned by the plugin during binlog purging results in a
crash in mysql8 as the server tries to execute
binlog_error_action_abort. We need to differentiate explicitly between a
plugin error and other error (such as error related to doing disk IO
etc). In thsi particular case, the crash is because of trying to purge a
file that does not exist (i.e which is already purged previosuly) and
raft cannot find it in its index chunk (so it returns a error).

Reviewed By: anirbanr-fb

Differential Revision: D31149997

---------------------------------------------------------------------

update flaky rpl_raft_purged_gtids_dump_threads

Summary:
rpl_raft_purged_gtids_dump_threads MTR failed due to "Cannot replicate because the master purged required binary logs" after server4 will tail server2.

Try to sync server4 before switch to tail server2.

Reviewed By: bhatvinay

Differential Revision: D30818078

---------------------------------------------------------------------

fix various porting issues in mysql8 (raft) crash recovery

Summary:
1. Trimming of the binlog is left to raft plugin (based on the current
    leader's log). Server should skip this step as part of recovery. This
    essentially means setting 'valid_pos' to the last successfully parsed
    trx in the trx log (instead of the engine's view of the trx log) in
    MYSQL_BIN_LOG::recover()
  2. executed_gtid_set should be initialized based on the engine's view of
the trx log file cordinates. So, during startup appropriate flags need
to be passed into MYSQL_BIN_LOG::init_gtid_sets(). init_gtid_sets() is
already changed to handle this, but the flag was not set correctly
during server startup
3. Another fix is in MYSQL_BIN_LOG::init_gtid_sets()to corretly set the position
to read and calculate executed-gtid-set (based on the file name read from the
engine)

Reviewed By: anirbanr-fb

Differential Revision: D31284902

---------------------------------------------------------------------

show_raft_status should take LOCK_status

Summary:
This is a straight port of a 5.6 patch to 8.0

SHOW RAFT STATUS and SHOW GLOBAL STATUS go to the
same functions in the plugin and access shared data structures.
These functions return internal char * pointers to plugin global
variables. They need to be serialized with LOCK_status otherwise
it leads to race conditions.

Reviewed By: li-chi

Differential Revision: D31318340

---------------------------------------------------------------------

Disallowing reset master on raft replicasets

Summary:
This is a straight port of similar patch on 5.6 raft

reset master is inherently not raft compliant.
Its get rid of all binlogs and make an instance appear like
a fresh single instance database without consulting the ring.
We need to block it.

However under certain circumstances (in the future ) e.g. during
first time replicaset build, or when we can guarantee that instance
is single node raft ring, we can potentially do a reset master
followed by rebootstrap of raft. This can also be achieved by
disabling raft, reset master and then re-enabling raft, however
to keep open the door, I have left a force option and will thread
a mechanism from the plugin to call reset_master(force=true)

Reviewed By: li-chi

Differential Revision: D31299214

---------------------------------------------------------------------

Setting topology config in MTR

Summary:
Setting topology config in MTR so that feature that use
topology config can be tested easily. Setting rpl_raft_skip_smc_updates
to avoid unnecessary calls to SMC (even though we supply a dummy
replicaset name).

Reviewed By: li-chi

Differential Revision: D31543877

---------------------------------------------------------------------

Do not allow sql thread start when raft is doing a stop->{}->start transition

Summary:
This is a port of an existing patch made to 5.6 for Raft.

Raft will do a stop of the SQL thread during StopAllWrites.
Then it will repoint the binlog files and during that action, the
SQL threads have to remain stopped. We block it out in this diff by
keeping an atomic bool which can be checked from other functions.

This only applies to raft mode i.e. enable_raft_plugin = true.

Reviewed By: li-chi

Differential Revision: D31319499

---------------------------------------------------------------------

Handle printing of FD event generated by slave SQL thread

Summary:
Early returning in the FD:print() function makes mysqlbinlog not be able to parse Raft logs on secondaries.

The original commit which added this is d048c0f (P173872135)

To comply with the intent of the original bug fix, we avoid printing the FD event of a relay log as a 'BINLOG'.

Reviewed By: anirbanr-fb

Differential Revision: D26359417

---------------------------------------------------------------------

Add exponential backoff for smart restart

Summary:
RAFT Instance crash and Tx Logs filling up.

rocksdb can fill up txlogs.
we should stop restarting mysqld if we have restarted many times in a day

Reviewed By: anirbanr-fb

Differential Revision: D27372750

---------------------------------------------------------------------

always release data_lock mutex to avoid deadlock

Summary: in stage-1 replicaset, when kill a secondary instance, sometime the instance will run into deadlock due to process_raft_queue thread forgot to release its acquired mutex in raft_change_master

Reviewed By: Pushapgl, bhatvinay

Differential Revision: D27602667

---------------------------------------------------------------------

Gracefully exit mysqld_safe loop during backoff

Summary:
Currentt systemctl stop mysqld@3306.service can take 10 mins when mysqld_safe is in backoff period.

D28517599 adds a interrupt to sleep in mysql_stop, and mysqld_safe immediately break the retry loop if sleep is interruptted.

Reviewed By: anirbanr-fb

Differential Revision: D28606439

---------------------------------------------------------------------

Fix heap overflow in group_relay_log_name handling

Summary:
We were accessing group_relay_log_name in
Query_log_event::do_apply_event_worker() but it's assigned only after
the coordinator thread encounters an end event (i.e. xid event or a
query event with "COMMIT" or "ROLLBACK" query). This was causing a race
between accessing group_relay_log_name in the worker thread and writing
it on the coordinator thread. We don't need to set transaction position
in events other than end event, so now we set transaction position in
query event only if it's an end event. The race is eliminated because
group_relay_log_name is set before enqueuing the event to the worker
thread (in both dep repl and vanilla mts).

Reviewed By: lth

Differential Revision: D28767430

---------------------------------------------------------------------

fix memory during MYSQL_BIN_LOG::open_existing_binlog

Summary:
asandebug complain there are memory leaks during MYSQL_BIN_LOG open

Direct leak of 50 byte(s) in 1 object(s) allocated from:
    #0 0x67460ef in malloc
    #1 0x93f0777 in my_raw_malloc(unsigned long, int)
    #2 0x93f064a in my_malloc(unsigned int, unsigned long, int)
    #3 0x93f0eb0 in my_strdup(unsigned int, char const*, int)
    #4 0x8af01a6 in MYSQL_BIN_LOG::open(unsigned int, char const*, char const*, unsigned int)
    #5 0x8af8064 in MYSQL_BIN_LOG::open_binlog(char const*, char const*, unsigned long, bool, bool, bool, Format_description_log_event*, unsigned int, RaftRotateInfo*, bool)
    #6 0x8b00c00 in MYSQL_BIN_LOG::new_file_impl(bool, Format_description_log_event*, RaftRotateInfo*)
    percona#7 0x8d65e47 in rotate_relay_log(Master_info*, bool, bool, bool, RaftRotateInfo*)
    percona#8 0x8d661c0 in rotate_relay_log_for_raft(RaftRotateInfo*)
    percona#9 0x8c7696a in process_raft_queue
    percona#10 0xa0fa1fd in pfs_spawn_thread(void*)
    percona#11 0x7f8c9a12b20b in start_thread

release these memory before assign them

Reviewed By: Pushapgl

Differential Revision: D28819752

---------------------------------------------------------------------

Reduce max backoff time from 24h to 30mins

Summary:
Over the recent SEV0, raft instances crash looping because of block skew. Even after the clock is normal, mysqld_safe failed to bring back mysqld because extended 24hours backoff.

This diff reduces the max backoff time from 24h to 30mins. So it will still keep trying, but not so aggressively that filling up the trx logs.

Reviewed By: bart2

Differential Revision: D31661172

---------------------------------------------------------------------

Fix dump thread gtid_purged calculation in raft mode

Summary:
In raft mode, it is supposed to use gtid_purged_for_tailing, and lost_gtids should point to the it.

Because of a bug, lost_gtids pointer's reference never changes, and always points to gtid_state->get_lost_gtids(), which is gtid_purged, not gtid_purged_for_tailing.

Reviewed By: anirbanr-fb

Differential Revision: D34110214

fbshipit-source-id: 3f53a3a62f0
ldonoso pushed a commit that referenced this pull request Apr 12, 2022
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
ldonoso pushed a commit that referenced this pull request Apr 29, 2022
…ILER WARNINGS

Remove stringop-truncation warning in ndb_config.cpp by refactoring.

Change-Id: I1eea7fe190926a85502e73ca7ebf07d984af9a09
ldonoso pushed a commit that referenced this pull request Jul 12, 2022
**Problem:**

The test fail under ASAN:

```
==470513==ERROR: AddressSanitizer: heap-use-after-free on address 0x632000054e20 at pc 0x556599b68016 bp 0x7ffc630afb30 sp 0x7ffc630afb20
READ of size 8 at 0x632000054e20 thread T0
    #0 0x556599b68015 in destroy_rwlock(PFS_rwlock*) /tmp/ps/storage/perfschema/pfs_instr.cc:430
    #1 0x556599b30b82 in pfs_destroy_rwlock_v2(PSI_rwlock*) /tmp/ps/storage/perfschema/pfs.cc:2596
    #2 0x7fa44336d62e in inline_mysql_rwlock_destroy /tmp/ps/include/mysql/psi/mysql_rwlock.h:289
    #3 0x7fa44336da39 in vtoken_lock_cleanup::~vtoken_lock_cleanup() /tmp/ps/plugin/version_token/version_token.cc:517
    #4 0x7fa46a7188a6 in __run_exit_handlers /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:108
    #5 0x7fa46a718a5f in __GI_exit /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:139
    #6 0x556596531da2 in mysqld_exit /tmp/ps/sql/mysqld.cc:2512
    percona#7 0x55659655d579 in mysqld_main(int, char**) /tmp/ps/sql/mysqld.cc:8505
    percona#8 0x55659609c5b5 in main /tmp/ps/sql/main.cc:25
    percona#9 0x7fa46a6f6082 in __libc_start_main ../csu/libc-start.c:308
    percona#10 0x55659609c4ed in _start (/tmp/results/PS/runtime_output_directory/mysqld+0x3c1b4ed)

0x632000054e20 is located 50720 bytes inside of 90112-byte region [0x632000048800,0x63200005e800)
freed by thread T0 here:
    #0 0x7fa46b5f940f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:122
    #1 0x556599b617eb in pfs_free(PFS_builtin_memory_class*, unsigned long, void*) /tmp/ps/storage/perfschema/pfs_global.cc:113
    #2 0x556599b61a15 in pfs_free_array(PFS_builtin_memory_class*, unsigned long, unsigned long, void*) /tmp/ps/storage/perfschema/pfs_global.cc:177
    #3 0x556599b6f28b in PFS_buffer_default_allocator<PFS_rwlock>::free_array(PFS_buffer_default_array<PFS_rwlock>*) /tmp/ps/storage/perfschema/pfs_buffer_container.h:172
    #4 0x556599b75628 in PFS_buffer_scalable_container<PFS_rwlock, 1024, 1024, PFS_buffer_default_array<PFS_rwlock>, PFS_buffer_default_allocator<PFS_rwlock> >::cleanup() /tmp/ps/storage/perfschema/pfs_buffer_container.h:452
    #5 0x556599b6d591 in cleanup_instruments() /tmp/ps/storage/perfschema/pfs_instr.cc:231
    #6 0x556599b8c3f1 in cleanup_performance_schema /tmp/ps/storage/perfschema/pfs_server.cc:343
    percona#7 0x556599b8dcfc in shutdown_performance_schema() /tmp/ps/storage/perfschema/pfs_server.cc:374
    percona#8 0x556596531d96 in mysqld_exit /tmp/ps/sql/mysqld.cc:2500
    percona#9 0x55659655d579 in mysqld_main(int, char**) /tmp/ps/sql/mysqld.cc:8505
    percona#10 0x55659609c5b5 in main /tmp/ps/sql/main.cc:25
    percona#11 0x7fa46a6f6082 in __libc_start_main ../csu/libc-start.c:308

previously allocated by thread T0 here:
    #0 0x7fa46b5fa6e5 in __interceptor_posix_memalign ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:217
    #1 0x556599b6167e in pfs_malloc(PFS_builtin_memory_class*, unsigned long, int) /tmp/ps/storage/perfschema/pfs_global.cc:68
    #2 0x556599b6187a in pfs_malloc_array(PFS_builtin_memory_class*, unsigned long, unsigned long, int) /tmp/ps/storage/perfschema/pfs_global.cc:155
    #3 0x556599b6fa9e in PFS_buffer_default_allocator<PFS_rwlock>::alloc_array(PFS_buffer_default_array<PFS_rwlock>*) /tmp/ps/storage/perfschema/pfs_buffer_container.h:159
    #4 0x556599b6ff12 in PFS_buffer_scalable_container<PFS_rwlock, 1024, 1024, PFS_buffer_default_array<PFS_rwlock>, PFS_buffer_default_allocator<PFS_rwlock> >::allocate(pfs_dirty_state*) /tmp/ps/storage/perfschema/pfs_buffer_container.h:602
    #5 0x556599b69abc in create_rwlock(PFS_rwlock_class*, void const*) /tmp/ps/storage/perfschema/pfs_instr.cc:402
    #6 0x556599b341f5 in pfs_init_rwlock_v2(unsigned int, void const*) /tmp/ps/storage/perfschema/pfs.cc:2578
    percona#7 0x556599b9487b in inline_mysql_rwlock_init /tmp/ps/include/mysql/psi/mysql_rwlock.h:261
    percona#8 0x556599b94ba7 in init_pfs_tls_channels_instrumentation() /tmp/ps/storage/perfschema/pfs_tls_channel.cc:209
    percona#9 0x556599b8ca44 in initialize_performance_schema(PFS_global_param*, PSI_thread_bootstrap**, PSI_mutex_bootstrap**, PSI_rwlock_bootstrap**, PSI_cond_bootstrap**, PSI_file_bootstrap**, PSI_socket_bootstrap**, PSI_table_bootstrap**, PSI_mdl_bootstrap**, PSI_idle_bootstrap**, PSI_stage_bootstrap**, PSI_statement_bootstrap**, PSI_transaction_bootstrap**, PSI_memory_bootstrap**, PSI_error_bootstrap**, PSI_data_lock_bootstrap**, PSI_system_bootstrap**, PSI_tls_channel_bootstrap**) /tmp/ps/storage/perfschema/pfs_server.cc:266
    percona#10 0x55659655a585 in mysqld_main(int, char**) /tmp/ps/sql/mysqld.cc:7497
    percona#11 0x55659609c5b5 in main /tmp/ps/sql/main.cc:25
    percona#12 0x7fa46a6f6082 in __libc_start_main ../csu/libc-start.c:308

SUMMARY: AddressSanitizer: heap-use-after-free /tmp/ps/storage/perfschema/pfs_instr.cc:430 in destroy_rwlock(PFS_rwlock*)
Shadow bytes around the buggy address:
  0x0c6480002970: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002980: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002990: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029a0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0c64800029c0: fd fd fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c64800029f0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002a00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c6480002a10: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==470513==ABORTING
```

The reason of the error is Percona's change on
5ae4d27 which causes the static
variables of the plugin not to be deallocated.

This causes `void cleanup_instruments()` to be called before
`vtoken_lock_cleanup::~vtoken_lock_cleanup()`, which finds
the memory of the object to have been deallocated.

**Solution:**

Do not run the tests under ASAN or Valgrind.
ldonoso pushed a commit that referenced this pull request Jul 13, 2022
**Problem:**

The following leak is detected when running the test
`encryption.upgrade_crypt_data_57_v1`:

```
==388399==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 70 byte(s) in 1 object(s) allocated from:
    #0 0x7f5f87812808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x55f098875d2c in ut::detail::malloc(unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/allocator_traits.h:71
    #2 0x55f098875db5 in ut::detail::Alloc_fn::malloc(unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/allocator_traits.h:88
    #3 0x55f0988aa4b9 in void* ut::detail::Alloc_fn::alloc<false>(unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/allocator_traits.h:97
    #4 0x55f09889b7a3 in void* ut::detail::Alloc_pfs::alloc<false>(unsigned long, unsigned int) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/alloc.h:275
    #5 0x55f09889bb9a in std::enable_if<ut::detail::Alloc_pfs::is_pfs_instrumented_v, void*>::type ut::detail::Alloc_<ut::detail::Alloc_pfs>::alloc<false, ut::detail::Alloc_pfs>(unsigned long, unsigned int) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/detail/ut/alloc.h:438
    #6 0x55f0988767dd in ut::malloc_withkey(ut::PSI_memory_key_t, unsigned long) /home/ldonoso/src/release-8.0.29-20/storage/innobase/include/ut0new.h:604
    percona#7 0x55f09937dd3c in rec_copy_prefix_to_buf_old /home/ldonoso/src/release-8.0.29-20/storage/innobase/rem/rem0rec.cc:1206
    percona#8 0x55f09937dfd3 in rec_copy_prefix_to_buf(unsigned char const*, dict_index_t const*, unsigned long, unsigned char**, unsigned long*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/rem/rem0rec.cc:1233
    percona#9 0x55f098ae0ae3 in dict_index_copy_rec_order_prefix(dict_index_t const*, unsigned char const*, unsigned long*, unsigned char**, unsigned long*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0dict.cc:3764
    percona#10 0x55f098c3d0ba in btr_pcur_t::store_position(mtr_t*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/btr/btr0pcur.cc:141
    percona#11 0x55f098c027b6 in dict_getnext_system_low /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:256
    percona#12 0x55f098c02933 in dict_getnext_system(btr_pcur_t*, mtr_t*) /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:298
    percona#13 0x55f098c0c05b in dict_check_sys_tables /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:1573
    percona#14 0x55f098c1770d in dict_load_tablespaces_for_upgrade() /home/ldonoso/src/release-8.0.29-20/storage/innobase/dict/dict0load.cc:3233
    percona#15 0x55f0987e9ed1 in innobase_init_files /home/ldonoso/src/release-8.0.29-20/storage/innobase/handler/ha_innodb.cc:6072
    percona#16 0x55f098819ed3 in innobase_ddse_dict_init /home/ldonoso/src/release-8.0.29-20/storage/innobase/handler/ha_innodb.cc:13985
    percona#17 0x55f097fa5c10 in dd::bootstrap::DDSE_dict_init(THD*, dict_init_mode_t, unsigned int) /home/ldonoso/src/release-8.0.29-20/sql/dd/impl/bootstrap/bootstrapper.cc:742
    percona#18 0x55f0986696a6 in dd::upgrade_57::do_pre_checks_and_initialize_dd(THD*) /home/ldonoso/src/release-8.0.29-20/sql/dd/upgrade_57/upgrade.cc:922
    percona#19 0x55f09550e082 in handle_bootstrap /home/ldonoso/src/release-8.0.29-20/sql/bootstrap.cc:327
    percona#20 0x55f0997416e7 in pfs_spawn_thread /home/ldonoso/src/release-8.0.29-20/storage/perfschema/pfs.cc:2943
    percona#21 0x7f5f876a1608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477

SUMMARY: AddressSanitizer: 70 byte(s) leaked in 1 allocation(s).
```

**Solution:**

The cause of the leak raises from the traversing of `pcur`. When
traversing is exhausted `pcur.close()` is automatically called and all
`pcur` resources are deallocated.

Percona adds some early returns to the traverse, hence sometimes the
traversing is not exhausted and `pcur.close()` is not called.

The solution is calling `pcur.close()` explicitly. `close()` is an
idempotent function so it is not a bug if it is called several times as
a result of this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet