Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fpm TCP connection stay CLOSE_WAIT when restart mongodb server. #597

Closed
lightjiao opened this issue May 25, 2017 · 10 comments
Closed

fpm TCP connection stay CLOSE_WAIT when restart mongodb server. #597

lightjiao opened this issue May 25, 2017 · 10 comments

Comments

@lightjiao
Copy link

stop or restart mongo server, the fpm connection go into CLOSE_WAIT, but not disconnect.

my problem

It's a connection BUG? Or I used in a wrong way? Or my start&stop scripts has some problem?

Environment

software version
OS CentOS release 6.9
PHP 5.6.22
mongo 3.0.6
PHP-Driver 1.2.9

my PHP was make from source code, not installed through a third-party bundle

some php info

$./bin/php  -i | grep -iE 'mongo|libmongoc|libbson'
mongo
MongoDB Support => enabled
MONGODB-CR => enabled
MONGODB-X509 => enabled
mongo.allow_empty_keys => 0 => 0
mongo.chunk_size => 261120 => 261120
mongo.cmd => $ => $
mongo.default_host => localhost => localhost
mongo.default_port => 27017 => 27017
mongo.is_master_interval => 15 => 15
mongo.long_as_object => 0 => 0
mongo.native_long => 1 => 1
mongo.ping_interval => 5 => 5
mongodb
MongoDB support => enabled
MongoDB extension version => 1.2.9
MongoDB extension stability => stable
libbson bundled version => 1.5.5
libmongoc bundled version => 1.5.5
libmongoc SSL => enabled
libmongoc SSL library => OpenSSL
libmongoc crypto => enabled
libmongoc crypto library => libcrypto
libmongoc crypto system profile => disabled
libmongoc SASL => disabled
mongodb.debug => no value => no value

mongo stop & start script

start
/bin/mongod -f ./conf/mongodb.yaml
stop
./bin/mongod -f ./conf/mongodb.yaml --shutdown
./conf/mongodb.yaml

systemLog:
    destination: file
    path: "/usr/local/easyops/mongodb/log/mongod.log"
    logAppend: true
    timeStampFormat: "ctime"

processManagement:
    fork: true

net:
    port: 27017
    bindIp: "127.0.0.1"
    unixDomainSocket:
       enabled: true
       pathPrefix: "/usr/local/easyops/mongodb/"

storage:
   dbPath: "/usr/local/easyops/mongodb/data"
   journal:
       enabled: true
   engine: "wiredTiger"
   wiredTiger:
       engineConfig:
           cacheSizeGB: 1

security:
    authorization: "enabled"
    # __CLUSTER_KEY_FILE__

when restart mongo server the fpm tcp connection was close_waite

the mongo server stop and start, but the fpm connection stayed in CLOSE_WAIT, my application goes wrong when connect to the mongo server

[root@localhost mongodb]# lsof -i | grep 27017 | grep fpm
php-fpm   18830 easyops    9u  IPv4 4984752      0t0  TCP localhost:35604->localhost:27017 (CLOSE_WAIT)
php-fpm   18830 easyops   10u  IPv4 5171898      0t0  TCP localhost:52240->localhost:27017 (CLOSE_WAIT)
php-fpm   18831 easyops    8u  IPv4 5024369      0t0  TCP localhost:57050->localhost:27017 (CLOSE_WAIT)
php-fpm   18831 easyops   10u  IPv4 5436247      0t0  TCP localhost:54298->localhost:27017 (CLOSE_WAIT)
php-fpm   18832 easyops    9u  IPv4 5024934      0t0  TCP localhost:57300->localhost:27017 (CLOSE_WAIT)
php-fpm   18832 easyops   10u  IPv4 5053247      0t0  TCP localhost:44060->localhost:27017 (CLOSE_WAIT)
php-fpm   18833 easyops    9u  IPv4 4999349      0t0  TCP localhost:43128->localhost:27017 (CLOSE_WAIT)
php-fpm   18833 easyops   10u  IPv4 5069992      0t0  TCP localhost:53424->localhost:27017 (CLOSE_WAIT)
php-fpm   18834 easyops    8u  IPv4 5001182      0t0  TCP localhost:44442->localhost:27017 (CLOSE_WAIT)
php-fpm   18834 easyops   10u  IPv4 5024942      0t0  TCP localhost:57306->localhost:27017 (CLOSE_WAIT)
php-fpm   18835 easyops    9u  IPv4 4984851      0t0  TCP localhost:35690->localhost:27017 (CLOSE_WAIT)
php-fpm   18835 easyops   10u  IPv4 5082463      0t0  TCP localhost:60350->localhost:27017 (CLOSE_WAIT)
php-fpm   18836 easyops    8u  IPv4 5232550      0t0  TCP localhost:56954->localhost:27017 (CLOSE_WAIT)
php-fpm   18836 easyops   10u  IPv4 5181455      0t0  TCP localhost:57380->localhost:27017 (CLOSE_WAIT)
php-fpm   18837 easyops    9u  IPv4 4985527      0t0  TCP localhost:36164->localhost:27017 (CLOSE_WAIT)
php-fpm   18837 easyops   10u  IPv4 5024913      0t0  TCP localhost:57282->localhost:27017 (CLOSE_WAIT)
php-fpm   18838 easyops    8u  IPv4 5001098      0t0  TCP localhost:44380->localhost:27017 (CLOSE_WAIT)
php-fpm   18838 easyops   10u  IPv4 5001117      0t0  TCP localhost:44398->localhost:27017 (CLOSE_WAIT)
php-fpm   18839 easyops    8u  IPv4 5011899      0t0  TCP localhost:50138->localhost:27017 (CLOSE_WAIT)
php-fpm   18839 easyops   10u  IPv4 5082426      0t0  TCP localhost:60338->localhost:27017 (CLOSE_WAIT)
php-fpm   18848 easyops    8u  IPv4 4984828      0t0  TCP localhost:35670->localhost:27017 (CLOSE_WAIT)
php-fpm   18848 easyops   10u  IPv4 5070000      0t0  TCP localhost:53430->localhost:27017 (CLOSE_WAIT)

but when the fpm connection pool reconnect, one of the connection get wrong return and change into ESTABLISHED

php-fpm   23735 easyops    8u  IPv4 6097916      0t0  TCP localhost:42838->localhost:27017 (CLOSE_WAIT)
php-fpm   23736 easyops    7u  IPv4 6107684      0t0  TCP localhost:48132->localhost:27017 (CLOSE_WAIT)
php-fpm   23738 easyops    8u  IPv4 6097725      0t0  TCP localhost:42762->localhost:27017 (CLOSE_WAIT)
php-fpm   23740 easyops    8u  IPv4 6107505      0t0  TCP localhost:48052->localhost:27017 (CLOSE_WAIT)
php-fpm   23741 easyops    8u  IPv4 6097885      0t0  TCP localhost:42818->localhost:27017 (CLOSE_WAIT)
php-fpm   23742 easyops    8u  IPv4 6097873      0t0  TCP localhost:42814->localhost:27017 (CLOSE_WAIT)
php-fpm   23743 easyops    9u  IPv4 6107403      0t0  TCP localhost:47966->localhost:27017 (CLOSE_WAIT)
php-fpm   23744 easyops    9u  IPv4 6107660      0t0  TCP localhost:48110->localhost:27017 (CLOSE_WAIT)
php-fpm   23983 easyops    6u  IPv4 6108343      0t0  TCP localhost:48424->localhost:27017 (ESTABLISHED)

my problem repeat

It's a connection BUG? Or I used in a wrong way? Or my start&stop scripts has some problem?

My English was not very well, thanks reading

@lightjiao
Copy link
Author

any one reply me

@jmikola
Copy link
Member

jmikola commented Jun 1, 2017

Apologies for losing track of this.

but when the fpm connection pool reconnect, one of the connection get wrong return and change into ESTABLISHED

When you say "get wrong return", are you referring to the PHP driver throwing an exception? If a PHP request uses a previously persisted socket, which is now in the CLOSE_WAIT state due to the remote side dropping the connection, I would expect the driver to throw an exception (likely "Stream is closed"). On a subsequent to select that server, the driver would first need to update its topology, which will entail reconnecting an issuing an isMaster command.

This is documented behavior from our server selection and server discovery and monitoring specifications.

There are a few connection options options that you may be able to use to avoid this error:

  • serverSelectionTryOnce: This defaults to true, which eagerly throws an error if the driver cannot select a server (i.e. socket) for an operation. Changing this to false would allow the driver up to serverSelectionTimeoutMS (default: 30 seconds) to find an eligible server, during which period it may continue to monitor servers with isMaster commands.
  • heartbeatFrequencyMS: This defaults to 60 seconds and denotes the interval for the driver to monitor all connected servers. At the expense of more IO, you can lower this as you like down to 500ms (i.e. minHeartbeatFrequencyMS).

Combining serverSelectionTryOnce=false with a lower heartbeatFrequencyMS value will improve your chances for the driver discovering the dropped connection and reconnecting during the server selection process. However, there is always a chance that the driver will select a server based on its view of the topology and will not discover that the connection is dropped until you attempt to use it for an operation. This could be caused by restarting the server between a heartbeat interval (i.e. we selected based on stale information) or between the time we select a server and the operation is executed (i.e. we selected based on accurate information, but the state changed before we could use the socket).


This begs the question: what can we do to mitigate the edge case where we don't discover a dropped connection until attempting an operation on it?

One future improvement to the SDAM spec is to retry failed isMaster requests once to allow the driver a chance to quietly recover from a dropped connection (CDRIVER-2075). That improvement would alleviate similar issues where a remote connection is dropped and the driver does not become aware until it tries to monitor the server. In your case, where both sides of the connection are on the same machine, the OS is always aware that one side has dropped the connection and we see a "Stream is closed" error thrown before the driver attempts isMaster, so CDRIVER-2075 would not apply. Therefore, I just created CDRIVER-2172 to request that we also make one attempt to reestablish a connection before returning a "Stream is closed" error.

@lightjiao
Copy link
Author

Indeed, TCP connection for the CLOSE_WAITE state, PHP dirver should throw Exception, I actually use the process is to capture the Exception.
But I hope that the PHP driver connection status can be timely access to the Mongo server issued by the disconnect operation, making the TCP connection in a timely manner.
As you said, shortening the value of the heartbeatFrequencyMS option is able to refresh the connection state as quickly as possible, but as you said it does not solve the problem perfectly.

In the extreme case, when the TCP connection is not available, I certainly want to make one attempt to reestablish a connection before returning a "Stream is closed" error as you described in CDRIVER-2172.

But now, but how should I solve this problem now?
HeartbeatFrequencyMS is definitely not going to try, I try the serverSelectionTryOnce option to help me.

I use google translate, thanks reading.

@lightjiao
Copy link
Author

In addition, ask what is SDAM? Googlt tells me Severely Deficient Autobi Circle, but I think it should not be the meaning. Can you tell me?

I use google translate, thanks reading.

@jmikola
Copy link
Member

jmikola commented Jun 2, 2017

But now, but how should I solve this problem now? heartbeatFrequencyMS is definitely not going to try, I try the serverSelectionTryOnce option to help me.

I would start with serverSelectionTryOnce=false. If the driver encounters a socket error during monitoring, it will wait five seconds (cooldownMS) before checking the socket again in order to avoid frequently checking a failed socket. Therefore, I would not suggest lowering serverSelectionTimeoutMS below five seconds so that at least two attempts can be made (assuming the first fails, the driver can wait five seconds and then try again).

Beyond that, CDRIVER-2075 and CDRIVER-2172 should allow the driver to retry failed monitoring attempts one time without having to wait for cooldownMS between attempts. I don't have a time estimate for when these improvements will make into the PHP driver. CDRIVER-2075 is currently scheduled for libmongoc 1.8.0. The next minor version of the PHP driver (1.3.0) is expected to use libmongoc 1.7.0 (estimated be released in the coming weeks).

In addition, ask what is SDAM?

This is an acronym for server discovery and monitoring, which is one of the two specs I referenced in my last comment.

@lightjiao
Copy link
Author

Before executing the real Mongo operation, I will execute the following code twice before returning to success.

try {
    $this->client->selectDatabase("admin")->command(["isMaster" => 1])->toArray()[0];
}
catch(\Exception $exception) {
    Log::error("Excute is master failed " . $exception->getMessage());
    $this->code = ReturnCode::DATABASE_CONNECT_FAILED;
    return false;
}

At the same time I also set the parameters below

$uriOptions["serverSelectionTryOnce"] = false;
$uriOptions["serverSelectionTimeoutMS"] = 4000;

But follow-up mongo operation will still return the database connection error after the success of the implementation of isMaster. I don't know how to solve my problem....

@jmikola
Copy link
Member

jmikola commented Jun 2, 2017

In my last comment, I wrote:

Therefore, I would not suggest lowering serverSelectionTimeoutMS below five seconds

You are using 4000 in your connection URI, which will not allow server selection to surpass cooldownMS. This inhibits server selection's ability to monitor a failed socket twice in the same round.

It's not entirely clear to me from your previous comment which errors you were getting, but it sounds like you experienced two exceptions on the same socket. I created a script to test that and discovered what I believe to be a separate bug in libmongoc (CDRIVER-2174). Please read through that issue and experiment with the included test script (modified for your local server, of course) and confirm if that is the same error you're experiencing.

@lightjiao
Copy link
Author

In fact, serverSelectionTimeoutMS this parameter I tried a variety of situations (greater than 5000, less than 5000, equal to 5000 and not set it), but failed to solve my problem. I'm sorry I was not able to express this in the last comment.

$this->client->selectDatabase("admin")->command(["isMaster" => 1])->toArray()[0]; 

This code in my logic is going to checke the connection state of database( with one more check request when checking failed), to ensure that the follow-up database operation success.
But you know, whatever command(["isMaster" => 1]) function success or failure, CLOSE_WAITE state of the connection has no change, dabase operation still failed with ConnectionException.

So come back to the question I want to express in this issue:
I restart mongodb server, PHP fpm connection state turn to CLOSE_WAITE, and then my web server get trouble.

For now, the serverSelectionTryOnce and serverSelectionTimeoutMS parameters are not helpful(CDRIVER-2172);
HeartbeatFrequencyMS can slightly ease the situation, but still did not perfect solution to the problem;
excute command(["isMaster" => 1]) in code logic has no change of result(as well as command["ping" => 1]);

The problem of this database connection to my software had a big bad impact, but also would like to ask if there are some mitigation measures.

I use Google translate, The syntax is not smooth, thanks reading.

@jmikola
Copy link
Member

jmikola commented Jun 5, 2017

This code in my logic is going to checke the connection state of database( with one more check request when checking failed), to ensure that the follow-up database operation success.

If I understand correctly, you are still only retrying the operation once after the first "Stream is closed" exception. In CDRIVER-2174, I demonstrated that the second retry after "Stream is closed" may result in a second exception: "Failed to send "ismaster" command with database "admin": Failure during socket delivery: Broken pipe (32)". Therefore, two retries are necessary for the driver to recover and leave the socket in a usable state.

Did you attempt to reproduce my report in CDRIVER-2174? I'd like to confirm that you understand what I've reported in that issue and the code paths being exercised by the test script I've attached to the issue, as I believe it accurately reflects your issue.

I restart mongodb server, PHP fpm connection state turn to CLOSE_WAITE, and then my web server get trouble.

The problem of this database connection to my software had a big bad impact, but also would like to ask if there are some mitigation measures.

I might suggest restarting FPM after you restart the MongoDB server, which would effectively flush any persisted connections. Apart from that, I believe we've covered all other mitigation measures by discussing the aforementioned connection string options.

In conflusion, there are a number of outstanding issues with libmongoc that must be fixed before the PHP driver can avoid the edge case you're experiencing.

@lightjiao
Copy link
Author

I try the command->(["ping" => 1]) 3 times before real databases opetation to ensure the connection.

there is no problem, thank you very much。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants