Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connection retry interval and support timeout for pvio_socket_internal_connect #204

Closed

Conversation

HugoWenTD
Copy link

Add connection retry interval and support timeout for pvio_socket_internal_connect

There's an infinite loop in function pvio_socket_internal_connect to
retry connection forever if receiving specific errno number:

  • No retry interval in the loop which will cause the process consuming all
    CPU usage and slowdown/stuck other processes.
  • Even if the timeout was set in mysqladmin command option, the function
    will retry forever in the infinite loop.

The issue was seen a few times when MariaDB server was hanging and not
accepting new connection correctly. When the issue happened mysqladmin
process consumed 100% CPU usage and stack trace shows it stuck in the
loop. Though --connect_timeout was set in command line, mysqladmin
will never end until a force kill was executed.

In this commit a retry interval is added in the loop to avoid exhausting
the CPU resource. Also it will check the elapsed time in the loop and
break if it passed the timeout threshold.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.

…ternal_connect`

There's an infinite loop in function `pvio_socket_internal_connect` to
retry connection forever if receiving specific errno number:
- No retry interval in the loop which will cause the process consuming all
  CPU usage and slowdown/stuck other processes.
- Even if the timeout was set in `mysqladmin` command option, the function
  will retry forever in the infinite loop.

The issue was seen a few times when MariaDB server was hanging and not
accepting new connection correctly. When the issue happened `mysqladmin`
process consumed 100% CPU usage and stack trace shows it stuck in the
loop. Though `--connect_timeout` was set in command line, `mysqladmin`
will never end until a force kill was executed.

In this commit a retry interval is added in the loop to avoid exhausting
the CPU resource. Also it will check the elapsed time in the loop and
break if it passed the timeout threshold.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
@HugoWenTD
Copy link
Author

HugoWenTD commented Jul 22, 2022

Add the detailed pstack and process list when the issue happened for reference.
Debugged using GDB, the process was stuck in pvio_socket_internal_connect() and kept retrying connect()

[20:49:36][root][~]$ pstack 21189
#0  0x0000150166ec2926 in connect () from /lib64/libpthread.so.0
#1  0x000055f504c87eb8 in pvio_socket_internal_connect (namelen=110, name=0x7ffd81109340, pvio=0x55f5064458d0) at /MariaDB/libmariadb/plugins/pvio/pvio_socket.c:642
#2  pvio_socket_connect_sync_or_async (pvio=pvio@entry=0x55f5064458d0, name=name@entry=0x7ffd81109340, namelen=110) at /MariaDB/libmariadb/plugins/pvio/pvio_socket.c:750
#3  0x000055f504c88473 in pvio_socket_connect (pvio=0x55f5064458d0, cinfo=0x7ffd81109430) at /MariaDB/libmariadb/plugins/pvio/pvio_socket.c:804
#4  0x000055f504c702b5 in mthd_my_real_connect (mysql=0x7ffd8110a0b0, host=<optimized out>, user=0x55f506425558 "root", passwd=0x55f506425588 "xxx", db=0x0, port=0, unix_socket=<optimized out>, client_flag=2147483648) at /MariaDB/libmariadb/libmariadb/mariadb_lib.c:1462
#5  0x000055f504c6d1b4 in mysql_real_connect (mysql=0x7ffd8110a0b0, host=0x0, user=0x55f506425558 "root", passwd=0x55f506425588 "xxx", db=0x0, port=0, unix_socket=0x7ffd8110ade5 "/tmp/mysql.sock", client_flag=2147483648) at /MariaDB/libmariadb/libmariadb/mariadb_lib.c:1301
#6  0x000055f504c69edb in sql_connect (mysql=0x7ffd8110a0b0, wait=0) at /MariaDB/client/mysqladmin.cc:595
#7  0x000055f504c678d2 in main (argc=<optimized out>, argv=<optimized out>) at /MariaDB/client/mysqladmin.cc:453
root    21189 99.8  0.0 128892  3728 ?        R<   20:48   0:45 mysqladmin --no-defaults --user=root --password=x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx --connect_timeout 2 --socket /tmp/mysql.sock ping

@HugoWenTD HugoWenTD marked this pull request as ready for review July 22, 2022 18:48
@9EOR9 9EOR9 self-assigned this Jul 24, 2022
@9EOR9
Copy link
Collaborator

9EOR9 commented Jul 24, 2022

Thank you for your contribution.

Issue filed: CONC-607

Since it is a bug and not a new feature I will fix it in 3.1 first, then merging to 3.3 (3.2 is discontinued).

@9EOR9
Copy link
Collaborator

9EOR9 commented Jul 24, 2022

pushed into 3.1 and merged into 3.3

commt dfe3563192e43a48bef3a861e72d9d122b9b346c

@9EOR9 9EOR9 closed this Jul 24, 2022
@ottok
Copy link
Contributor

ottok commented Jul 25, 2022

@9EOR9 This was not nice: dfe3563

You should merge the original commit and not infringe on copyright and take ownership of it. If you want it to be rebased on another branch etc, then ask the submitter to rebase. If the submitter is unresponsive (which was not the case here, but could be in another case) then at least cherry-pick the commit and keep authorship. What you now did is a copyright violation.

FYI @grooverdan @ericherman @cvicentiu

@dlenski
Copy link
Contributor

dlenski commented Jul 25, 2022

In addition to what @ottok wrote, @HugoWenTD's original commit message contained a great deal more background information about how and why this change was made; preserving that thorough message in the history would be much more useful to anyone trying to modify or debug this code, e.g. with git blame or git bisect.

@ottok
Copy link
Contributor

ottok commented Aug 9, 2022

@9EOR9 Can you please revert the commit you did wrongly and include the proper commit? Thanks

@ottok
Copy link
Contributor

ottok commented Sep 2, 2022

@LinuxJedi Could you please assist here to rectify the error?

@LinuxJedi
Copy link

@LinuxJedi Could you please assist here to rectify the error?

Hi @ottok, Unfortunately Connector/C is a MariaDB Corporation project (hence the mariadb-corporation GitHub organization for the tree), outside the control of the Foundation. I will, however, try and ping people inside the Corporation to see if I can get you an answer here.

@vuvova
Copy link
Member

vuvova commented Sep 6, 2022

I don't think the commit takes ownership of @HugoWenTD code, on the opposite, it clearly says who the author is.
The original commit message indeed contains a lot more information, but amending the commit comment is not a violation, even if it's unfortunate.

I hope in the future @9EOR9 will use commits from PR as is. But I don't see there's a need to add garbage to the history by reverting and recommitting the same change over and over.

@ottok
Copy link
Contributor

ottok commented Oct 19, 2022

I don't think the commit takes ownership of @HugoWenTD code, on the opposite, it clearly says who the author is.
The original commit message indeed contains a lot more information, but amending the commit comment is not a violation, even if it's unfortunate.

@vuvova Yes, it clearly takes away the authorship. Compare these two options @9EOR9 could have done:

Option 1: Apply original commit and keep authorship and commit message

$ curl -O https://patch-diff.githubusercontent.com/raw/mariadb-corporation/mariadb-connector-c/pull/204.patch
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2394    0  2394    0     0   3842      0 --:--:-- --:--:-- --:--:--  3836

$ git am 204.patch
Applying: Add connection retry interval and support timeout for `pvio_socket_internal_connect`
23:45:21 2& :~/mariadb/pkg-mariadb-connector-c/upstream ff9c5db* 3s ± git show
commit ff9c5db720d88d16477dccd215b523d686c70f2a (HEAD)
Author: Hugo Wen <wenhug@amazon.com>
Date:   Thu Jul 21 20:29:47 2022 +0000

    Add connection retry interval and support timeout for `pvio_socket_internal_connect`
    
    There's an infinite loop in function `pvio_socket_internal_connect` to
    retry connection forever if receiving specific errno number:
    - No retry interval in the loop which will cause the process consuming all
      CPU usage and slowdown/stuck other processes.
    - Even if the timeout was set in `mysqladmin` command option, the function
      will retry forever in the infinite loop.
    
    The issue was seen a few times when MariaDB server was hanging and not
    accepting new connection correctly. When the issue happened `mysqladmin`
    process consumed 100% CPU usage and stack trace shows it stuck in the
    loop. Though `--connect_timeout` was set in command line, `mysqladmin`
    will never end until a force kill was executed.
    
    In this commit a retry interval is added in the loop to avoid exhausting
    the CPU resource. Also it will check the elapsed time in the loop and
    break if it passed the timeout threshold.
    
    All new code of the whole pull request, including one or several files
    that are either new files or modified ones, are contributed under the
    BSD-new license. I am contributing on behalf of my employer Amazon Web
    Services, Inc.

diff --git a/plugins/pvio/pvio_socket.c b/plugins/pvio/pvio_socket.c
index c4ca16c..5d00a6d 100644
--- a/plugins/pvio/pvio_socket.c
+++ b/plugins/pvio/pvio_socket.c
@@ -627,6 +627,10 @@ static int pvio_socket_internal_connect(MARIADB_PVIO *pvio,
   int rc= 0;
   struct st_pvio_socket *csock= NULL;
   int timeout;
+#ifndef _WIN32
+  unsigned int wait_conn= 1;
+  time_t start_t= time(NULL);
+#endif
 
   if (!pvio || !pvio->data)
     return 1;
@@ -640,6 +644,13 @@ static int pvio_socket_internal_connect(MARIADB_PVIO *pvio,
 #ifndef _WIN32
   do {
     rc= connect(csock->socket, (struct sockaddr*) name, (int)namelen);
+
+    if (time(NULL) - start_t > (time_t)timeout/1000)
+      break;
+
+    usleep(wait_conn);
+    wait_conn= wait_conn >= 1000000 ? 1000000 : wait_conn * 2;
+
   } while (rc == -1 && (errno == EINTR || errno == EAGAIN));
   /* in case a timeout values was set we need to check error values
      EINPROGRESS */

Option 2: Take code, add whitespace and rewrite author and commit message

$ git show dfe3563
commit dfe3563192e43a48bef3a861e72d9d122b9b346c
Author: Georg Richter <georg@mariadb.com>
Date:   Sun Jul 24 17:36:49 2022 +0200

    Fix for CONC-607 (Infinite loop in pvio_socket_internal_connect)
    
    A retry interval is added in the loop to avoid exhausting
    the CPU resource. Also the elapsed time in the loop will be
    checked and break if it passed the timeout threshold.
    
    Kudos to Hugo Wen who reported this issue and provided the fix!

diff --git a/plugins/pvio/pvio_socket.c b/plugins/pvio/pvio_socket.c
index c4ca16c..11ceaba 100644
--- a/plugins/pvio/pvio_socket.c
+++ b/plugins/pvio/pvio_socket.c
@@ -627,6 +627,10 @@ static int pvio_socket_internal_connect(MARIADB_PVIO *pvio,
   int rc= 0;
   struct st_pvio_socket *csock= NULL;
   int timeout;
+#ifndef _WIN32
+  unsigned int wait_conn= 1;
+  time_t start_t= time(NULL);
+#endif
 
   if (!pvio || !pvio->data)
     return 1;
@@ -640,6 +644,13 @@ static int pvio_socket_internal_connect(MARIADB_PVIO *pvio,
 #ifndef _WIN32
   do {
     rc= connect(csock->socket, (struct sockaddr*) name, (int)namelen);
+
+    if (time(NULL) - start_t > (time_t)timeout/1000)
+      break;
+
+    usleep(wait_conn);
+    wait_conn= wait_conn >= 1000000 ? 1000000 : wait_conn * 2;
+
   } while (rc == -1 && (errno == EINTR || errno == EAGAIN));
   /* in case a timeout values was set we need to check error values
      EINPROGRESS */
@@ -1126,3 +1137,4 @@ int pvio_socket_shutdown(MARIADB_PVIO *pvio)
   }
   return -1;
 }
+

Georg's version is a 1:1 copy of Hugo's submission, only one whitespace character is added to the end.

Clearly option 1 would have been better.

We reported this violation to you within 24 hours of Georg making it on July 25th. Additionally we reminded about this open violation again on August 9th and you didn't do anything to rectify it.

I hope in the future @9EOR9 will use commits from PR as is. But I don't see there's a need to add garbage to the history by reverting and recommitting the same change over and over.

We are not asking to do anything 'over and over'. We just asked to fix one problem, once. It is a recurring problem only if you make it several times.

@vuvova
Copy link
Member

vuvova commented Oct 19, 2022

I hope in the future @9EOR9 will use commits from PR as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants