Skip to content

Commit

Permalink
PS-8592: XCom connection stalled forever in read() syscall over network
Browse files Browse the repository at this point in the history
https://jira.percona.com/browse/PS-8592

Description
-----------
GR suffered from problems caused by the security probes and network scanner
processes connecting to the group replication communication port. This usually
is not a problem, but poses a serious threat when another member tries to join
the cluster by initialting a connection to the member which is affected by
external processes using the port dedicated for group communication for longer
durations.

On such activites by external processes, the SSL enabled server stalled forever
on the SSL_accept() call waiting for handshake data. Below is the stacktrace:

    Thread 55 (Thread 0x7f7bb77ff700 (LWP 2198598)):
    #0 in read ()
    #1 in sock_read ()
    #2 in BIO_read ()
    #3 in ssl23_read_bytes ()
    #4 in ssl23_get_client_hello ()
    #5 in ssl23_accept ()
    #6 in xcom_tcp_server_startup(Xcom_network_provider*) ()

When the server stalled in the above path forever, it prohibited other members
to join the cluster resulting in the following messages on the joiner server's
logs.

    [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
    [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

Solution
--------
This patch adds two new variables

1. group_replication_xcom_ssl_socket_timeout

   It is a file-descriptor level timeout in seconds for both accept() and
   SSL_accept() calls when group replication is listening on the xcom port.
   When set to a valid value, say for example 5 seconds, both accept() and
   SSL_accept() return after 5 seconds. The default value has been set to 0
   (waits infinitely) for backward compatibility. This variable is effective
   only when GR is configred with SSL.

2. group_replication_xcom_ssl_accept_retries

   It defines the number of retries to be performed before closing the socket.
   For each retry the server thread calls SSL_accept()  with timeout defined by
   the group_replication_xcom_ssl_socket_timeout for the SSL handshake process
   once the connection has been accepted by the first accept() call. The
   default value has been set to 10. This variable is effective only when GR is
   configred with SSL.

Note:
- Both of the above variables are dynamically configurable, but will become
  effective only on START GROUP_REPLICATION.
  • Loading branch information
venkatesh-prasad-v committed Aug 2, 2023
1 parent 8a7708d commit 257f4d2
Show file tree
Hide file tree
Showing 19 changed files with 390 additions and 16 deletions.
10 changes: 9 additions & 1 deletion mysql-test/include/start_proc_in_background.inc
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
# [--let $command_opt = opt1 opt2 ...]
# [--let $output_file = output_file]
# [--let $pid_file = pid_file]
# [--let $redirect_stderr = 0 | 1 ]
# --source include/start_proc_in_backcground.inc
#
# Parameters:
Expand All @@ -45,7 +46,14 @@ if (!$command)

if ($output_file)
{
--let $line = $line > $output_file
if ($redirect_stderr == 1)
{
--let $line = $line 2> $output_file
}
if ($redirect_stderr == 0)
{
--let $line = $line > $output_file
}
}

--let $line = $line &
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,10 @@ SET PERSIST_ONLY group_replication_tls_source = @@GLOBAL.group_replication_tls_s
SET PERSIST_ONLY group_replication_transaction_size_limit = @@GLOBAL.group_replication_transaction_size_limit;
SET PERSIST_ONLY group_replication_unreachable_majority_timeout = @@GLOBAL.group_replication_unreachable_majority_timeout;
SET PERSIST_ONLY group_replication_view_change_uuid = @@GLOBAL.group_replication_view_change_uuid;
SET PERSIST_ONLY group_replication_xcom_ssl_accept_retries = @@GLOBAL.group_replication_xcom_ssl_accept_retries;
SET PERSIST_ONLY group_replication_xcom_ssl_socket_timeout = @@GLOBAL.group_replication_xcom_ssl_socket_timeout;

include/assert.inc ['Expect 63 persisted variables.']
include/assert.inc ['Expect 65 persisted variables.']

############################################################
# 2. Restart server, it must bootstrap the group and preserve
Expand All @@ -94,9 +96,9 @@ include/assert.inc ['Expect 63 persisted variables.']
include/rpl_reconnect.inc
include/gr_wait_for_member_state.inc

include/assert.inc ['Expect 63 persisted variables in persisted_variables table.']
include/assert.inc ['Expect 62 variables which last value was set through SET PERSIST.']
include/assert.inc ['Expect 62 persisted variables with matching persisted and global values.']
include/assert.inc ['Expect 65 persisted variables in persisted_variables table.']
include/assert.inc ['Expect 64 variables which last value was set through SET PERSIST.']
include/assert.inc ['Expect 64 persisted variables with matching persisted and global values.']

############################################################
# 3. Test RESET PERSIST IF EXISTS.
Expand Down Expand Up @@ -164,6 +166,8 @@ RESET PERSIST IF EXISTS group_replication_tls_source;
RESET PERSIST IF EXISTS group_replication_transaction_size_limit;
RESET PERSIST IF EXISTS group_replication_unreachable_majority_timeout;
RESET PERSIST IF EXISTS group_replication_view_change_uuid;
RESET PERSIST IF EXISTS group_replication_xcom_ssl_accept_retries;
RESET PERSIST IF EXISTS group_replication_xcom_ssl_socket_timeout;

include/assert.inc ['Expect 0 persisted variables.']

Expand Down
12 changes: 8 additions & 4 deletions mysql-test/suite/group_replication/r/gr_persist_variables.result
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,10 @@ SET PERSIST group_replication_tls_source = @@GLOBAL.group_replication_tls_source
SET PERSIST group_replication_transaction_size_limit = @@GLOBAL.group_replication_transaction_size_limit;
SET PERSIST group_replication_unreachable_majority_timeout = @@GLOBAL.group_replication_unreachable_majority_timeout;
SET PERSIST group_replication_view_change_uuid = @@GLOBAL.group_replication_view_change_uuid;
SET PERSIST group_replication_xcom_ssl_accept_retries = @@GLOBAL.group_replication_xcom_ssl_accept_retries;
SET PERSIST group_replication_xcom_ssl_socket_timeout = @@GLOBAL.group_replication_xcom_ssl_socket_timeout;

include/assert.inc ['Expect 63 persisted variables.']
include/assert.inc ['Expect 65 persisted variables.']

############################################################
# 2. Restart server, it must bootstrap the group and preserve
Expand All @@ -96,9 +98,9 @@ include/assert.inc ['Expect 63 persisted variables.']
include/rpl_reconnect.inc
include/gr_wait_for_member_state.inc

include/assert.inc ['Expect 63 persisted variables in persisted_variables table.']
include/assert.inc ['Expect 62 variables which last value was set through SET PERSIST.']
include/assert.inc ['Expect 62 variables which last value was set through SET PERSIST is equal to its global value.']
include/assert.inc ['Expect 65 persisted variables in persisted_variables table.']
include/assert.inc ['Expect 64 variables which last value was set through SET PERSIST.']
include/assert.inc ['Expect 64 variables which last value was set through SET PERSIST is equal to its global value.']

############################################################
# 3. Test RESET PERSIST.
Expand Down Expand Up @@ -166,6 +168,8 @@ RESET PERSIST group_replication_tls_source;
RESET PERSIST group_replication_transaction_size_limit;
RESET PERSIST group_replication_unreachable_majority_timeout;
RESET PERSIST group_replication_view_change_uuid;
RESET PERSIST group_replication_xcom_ssl_accept_retries;
RESET PERSIST group_replication_xcom_ssl_socket_timeout;

include/assert.inc ['Expect 0 persisted variables.']

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ WHERE VARIABLE_NAME LIKE 'group_replication_%'
AND VARIABLE_NAME != 'group_replication_auto_evict_timeout'
AND VARIABLE_NAME != 'group_replication_certification_loop_chunk_size'
AND VARIABLE_NAME != 'group_replication_certification_loop_sleep_time'
AND VARIABLE_NAME != 'group_replication_xcom_ssl_socket_timeout'
AND VARIABLE_NAME != 'group_replication_xcom_ssl_accept_retries'
ORDER BY VARIABLE_NAME;
SET SESSION sql_log_bin = 1;
SET @value= @@GLOBAL.group_replication_advertise_recovery_endpoints;
Expand Down Expand Up @@ -225,6 +227,10 @@ SET @value= @@GLOBAL.group_replication_tls_source;
SET @@GLOBAL.group_replication_tls_source= @value;
SET @value= @@GLOBAL.group_replication_transaction_size_limit;
SET @@GLOBAL.group_replication_transaction_size_limit= @value;
SET @value= @@GLOBAL.group_replication_xcom_ssl_accept_retries;
SET @@GLOBAL.group_replication_xcom_ssl_accept_retries= @value;
SET @value= @@GLOBAL.group_replication_xcom_ssl_socket_timeout;
SET @@GLOBAL.group_replication_xcom_ssl_socket_timeout= @value;
############################################################
# 5. Validate that we did test all Group Replication options.
[connection server1]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ Note #### Storing MySQL user name or password information in the connection meta
include/start_and_bootstrap_group_replication.inc
include/stop_group_replication.inc

# Test#1: Basic check that there are 64 GR variables.
include/assert.inc [There are 64 GR variables at present.]
# Test#1: Basic check that there are 66 GR variables.
include/assert.inc [There are 66 GR variables at present.]

# Test#2: Verify group replication related variables at GLOBAL scope.
SET @@SESSION.group_replication_allow_local_lower_version_join= 1;
Expand Down
50 changes: 50 additions & 0 deletions mysql-test/suite/group_replication/r/gr_ssl_socket_timeout.result
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
include/group_replication.inc
Warnings:
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
Note #### Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
[connection server1]

############################################################
# 1. Start one member with GCS SSL enabled.
[connection server1]
SET @group_replication_ssl_mode_save= @@GLOBAL.group_replication_ssl_mode;
SET GLOBAL group_replication_ssl_mode= REQUIRED;
SET @group_replication_xcom_ssl_socket_timeout_save= @@GLOBAL.group_replication_xcom_ssl_socket_timeout;
SET @group_replication_xcom_ssl_accept_retries_save= @@GLOBAL.group_replication_xcom_ssl_accept_retries;
SET GLOBAL group_replication_xcom_ssl_socket_timeout= 3;
SET GLOBAL group_replication_xcom_ssl_accept_retries= 3;
include/start_and_bootstrap_group_replication.inc
Occurrences of 'Group communication SSL configuration: group_replication_ssl_mode: "REQUIRED"' in the input file: 1

############################################################
# 2. Start the second member with GCS SSL enabled, the member
# will be able to join the group.
[connection server2]
SET @group_replication_ssl_mode_save= @@GLOBAL.group_replication_ssl_mode;
SET GLOBAL group_replication_ssl_mode= REQUIRED;
include/start_group_replication.inc
include/rpl_gr_wait_for_number_of_members.inc
Occurrences of 'Group communication SSL configuration: group_replication_ssl_mode: "REQUIRED"' in the input file: 1

############################################################
# 3. Verify that any connection on group_replication
# communication port is aborted by the server after the
# timout configured by the group_replication_xcom_ssl_socket_timeout.
include/stop_group_replication.inc
SET @group_replication_communication_debug_options_save = @@GLOBAL.group_replication_communication_debug_options;
SET GLOBAL group_replication_communication_debug_options= "XCOM_DEBUG_BASIC";
START GROUP_REPLICATION;
SET @@GLOBAL.group_replication_communication_debug_options= @group_replication_communication_debug_options_save;
include/assert_grep.inc [Assert that the mysql connection has been ended by the server]
include/assert_grep.inc [Assert that message about aborting the connection has been logged to GCS_DEBUG_TRACE file]
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2

############################################################
# 4. Clean up.
[connection server1]
SET GLOBAL group_replication_ssl_mode= @group_replication_ssl_mode_save;
SET GLOBAL group_replication_xcom_ssl_socket_timeout= @group_replication_xcom_ssl_socket_timeout_save;
SET GLOBAL group_replication_xcom_ssl_accept_retries= @group_replication_xcom_ssl_accept_retries_save;
[connection server2]
SET GLOBAL group_replication_ssl_mode= @group_replication_ssl_mode_save;
include/group_replication_end.inc
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ include/stop_group_replication.inc
#
# Test Unit#1
# Set global/session group replication variables to default.
# Curently there are 64 group replication variables.
# Curently there are 66 group replication variables.
#
include/assert.inc [There are 64 GR variables at present.]
include/assert.inc [There are 66 GR variables at present.]
SET @@GLOBAL.group_replication_auto_increment_increment= default;
ERROR 42000: Variable 'group_replication_auto_increment_increment' can't be set to the value of 'DEFAULT'
SET @@GLOBAL.group_replication_compression_threshold= default;
Expand Down Expand Up @@ -84,6 +84,8 @@ SET @@GLOBAL.group_replication_advertise_recovery_endpoints = default;
SET @@GLOBAL.group_replication_view_change_uuid= default;
SET @@GLOBAL.group_replication_communication_stack = default;
SET @@GLOBAL.group_replication_paxos_single_leader = default;
SET @@GLOBAL.group_replication_xcom_ssl_socket_timeout = default;
SET @@GLOBAL.group_replication_xcom_ssl_accept_retries = default;
SET @@SESSION.group_replication_consistency= default;
#
# Test Unit#2
Expand Down Expand Up @@ -134,6 +136,8 @@ include/assert.inc [Default group_replication_advertise_recovery_endpoints is "D
include/assert.inc [Default group_replication_view_change_uuid is "AUTOMATIC"]
include/assert.inc [Default group_replication_communication_stack is XCom]
include/assert.inc [Default group_replication_paxos_single_leader is 0]
include/assert.inc [Default group_replication_xcom_ssl_socket_timeout is 0]
include/assert.inc [Default group_replication_xcom_ssl_accept_retries is 10]
#
# Clean up
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,10 @@ SET GLOBAL group_replication_unreachable_majority_timeout = @@GLOBAL.group_repli
ERROR 42000: Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation
SET GLOBAL group_replication_view_change_uuid = @@GLOBAL.group_replication_view_change_uuid;
ERROR 42000: Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation
SET GLOBAL group_replication_xcom_ssl_accept_retries = @@GLOBAL.group_replication_xcom_ssl_accept_retries;
ERROR 42000: Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation
SET GLOBAL group_replication_xcom_ssl_socket_timeout = @@GLOBAL.group_replication_xcom_ssl_socket_timeout;
ERROR 42000: Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation

# Like most system variables, setting the session value for
# group_replication_consistency requires no special privileges.
Expand Down Expand Up @@ -234,6 +238,8 @@ SET GLOBAL group_replication_tls_source = @@GLOBAL.group_replication_tls_source;
SET GLOBAL group_replication_transaction_size_limit = @@GLOBAL.group_replication_transaction_size_limit;
SET GLOBAL group_replication_unreachable_majority_timeout = @@GLOBAL.group_replication_unreachable_majority_timeout;
SET GLOBAL group_replication_view_change_uuid = @@GLOBAL.group_replication_view_change_uuid;
SET GLOBAL group_replication_xcom_ssl_accept_retries = @@GLOBAL.group_replication_xcom_ssl_accept_retries;
SET GLOBAL group_replication_xcom_ssl_socket_timeout = @@GLOBAL.group_replication_xcom_ssl_socket_timeout;

############################################################
# 4. Grant GROUP_REPLICATION_ADMIN and verify setting
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ INSERT INTO gr_options_that_cannot_be_change (name)
AND VARIABLE_NAME != 'group_replication_auto_evict_timeout'
AND VARIABLE_NAME != 'group_replication_certification_loop_chunk_size'
AND VARIABLE_NAME != 'group_replication_certification_loop_sleep_time'
AND VARIABLE_NAME != 'group_replication_xcom_ssl_socket_timeout'
AND VARIABLE_NAME != 'group_replication_xcom_ssl_accept_retries'
ORDER BY VARIABLE_NAME;
SET SESSION sql_log_bin = 1;
--let $gr_options_that_cannot_be_change_count= `SELECT COUNT(*) FROM gr_options_that_cannot_be_change;`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
--source include/start_and_bootstrap_group_replication.inc
--source include/stop_group_replication.inc

--let $gr_var_count= 64
--let $gr_var_count= 66

--echo
--echo # Test#1: Basic check that there are $gr_var_count GR variables.
Expand Down
130 changes: 130 additions & 0 deletions mysql-test/suite/group_replication/t/gr_ssl_socket_timeout.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
################################################################################
# This test verifies that any unintended connection on group_replication
# communication port is aborted by the server after the timout configured by
# the group_replication_xcom_ssl_socket_timeout.
#
# Test:
# 0. The test requires two servers: M1 and M2.
# 1. Enable group_replication_ssl_mode = REQUIRED on both members and start GR.
# 2. With both members ONLINE, stop GR on M2.
# 3. Initiate a connection on the GR communication port of M1 as a background
# process.
# 4. Start GR on M2.
# 5. Verify that START GR will be successful, after the server aborting the
# connection.
# 6. Cleanup
################################################################################

--source include/have_group_replication_xcom_communication_stack.inc
--source include/have_group_replication_plugin.inc
--let $rpl_skip_group_replication_start= 1
--source include/group_replication.inc


--echo
--echo ############################################################
--echo # 1. Start one member with GCS SSL enabled.
--let $rpl_connection_name= server1
--source include/rpl_connection.inc
SET @group_replication_ssl_mode_save= @@GLOBAL.group_replication_ssl_mode;
SET GLOBAL group_replication_ssl_mode= REQUIRED;

# Set the group_replication_xcom_ssl_socket_timeout and group_replication_xcom_ssl_accept_retries
SET @group_replication_xcom_ssl_socket_timeout_save= @@GLOBAL.group_replication_xcom_ssl_socket_timeout;
SET @group_replication_xcom_ssl_accept_retries_save= @@GLOBAL.group_replication_xcom_ssl_accept_retries;

SET GLOBAL group_replication_xcom_ssl_socket_timeout= 3;
SET GLOBAL group_replication_xcom_ssl_accept_retries= 3;

# Bootstrap and start group replication
--source include/start_and_bootstrap_group_replication.inc

# Verify that GR was started with group_replication_ssl_mode = REQUIRED
--let $grep_file= $MYSQLTEST_VARDIR/log/mysqld.1.err
--let $grep_pattern= Group communication SSL configuration: group_replication_ssl_mode: "REQUIRED"
--let $grep_output= print_count
--source include/grep_pattern.inc

--echo
--echo ############################################################
--echo # 2. Start the second member with GCS SSL enabled, the member
--echo # will be able to join the group.
--let $rpl_connection_name= server2
--source include/rpl_connection.inc
--disable_query_log
--eval SET GLOBAL group_replication_group_name= '$group_replication_group_name'
--enable_query_log

SET @group_replication_ssl_mode_save= @@GLOBAL.group_replication_ssl_mode;
SET GLOBAL group_replication_ssl_mode= REQUIRED;
--source include/start_group_replication.inc

--let $group_replication_number_of_members= 2
--source include/gr_wait_for_number_of_members.inc

--let $grep_file= $MYSQLTEST_VARDIR/log/mysqld.2.err
--let $grep_pattern= Group communication SSL configuration: group_replication_ssl_mode: "REQUIRED"
--let $grep_output= print_count
--source include/grep_pattern.inc

--echo
--echo ############################################################
--echo # 3. Verify that any connection on group_replication
--echo # communication port is aborted by the server after the
--echo # timout configured by the group_replication_xcom_ssl_socket_timeout.

# STOP GR on server2
--source include/stop_group_replication.inc

# Connect to GR communication port on server1. For the purpose of testing, we
# use mysql client here.
--connection server1
SET @group_replication_communication_debug_options_save = @@GLOBAL.group_replication_communication_debug_options;
SET GLOBAL group_replication_communication_debug_options= "XCOM_DEBUG_BASIC";
--let $gr_port= `SELECT SUBSTRING(@@group_replication_local_address, LOCATE(':',@@group_replication_local_address) + 1)`
--let $command= $MYSQL
--let $command_opt= --user=root --host=127.0.0.1 --port=$gr_port
--let $output_file= $MYSQLTEST_VARDIR/tmp/mysql_output
--let $pid_file= $MYSQLTEST_VARDIR/tmp/mysql_pid
--let $redirect_stderr= 1
--source include/start_proc_in_background.inc

--connection server2
START GROUP_REPLICATION;

--connection server1
SET @@GLOBAL.group_replication_communication_debug_options= @group_replication_communication_debug_options_save;
--source include/wait_proc_to_finish.inc

# Assert that mysql command has failed
--let $assert_text= Assert that the mysql connection has been ended by the server
--let $assert_select= Lost connection to MySQL server at \'reading initial communication packet\'
--let $assert_file= $output_file
--let $assert_count= 1
--source include/assert_grep.inc

# Assert that message about aborting the connection has been logged to GCS_DEBUG_TRACE file
--let $assert_text= Assert that message about aborting the connection has been logged to GCS_DEBUG_TRACE file
--let $assert_select= SSL_accept did receive any data on fd .* despite waiting for 12 seconds in total, aborting the connection.
--let $assert_file= $MYSQLTEST_VARDIR/mysqld.1/data/GCS_DEBUG_TRACE
--let $assert_count= 1
--source include/assert_grep.inc
--exec cat $output_file

--echo
--echo ############################################################
--echo # 4. Clean up.
--let $rpl_connection_name= server1
--source include/rpl_connection.inc
SET GLOBAL group_replication_ssl_mode= @group_replication_ssl_mode_save;
SET GLOBAL group_replication_xcom_ssl_socket_timeout= @group_replication_xcom_ssl_socket_timeout_save;
SET GLOBAL group_replication_xcom_ssl_accept_retries= @group_replication_xcom_ssl_accept_retries_save;

--let $rpl_connection_name= server2
--source include/rpl_connection.inc
SET GLOBAL group_replication_ssl_mode= @group_replication_ssl_mode_save;

--remove_file $pid_file
--remove_file $output_file
--remove_file $MYSQLTEST_VARDIR/mysqld.1/data/GCS_DEBUG_TRACE
--source include/group_replication_end.inc
Loading

0 comments on commit 257f4d2

Please sign in to comment.