New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel all connectors on process exit #1606

Open
mausch opened this Issue Jun 16, 2017 · 14 comments

Comments

Projects
None yet
4 participants
@mausch

mausch commented Jun 16, 2017

Currently, PoolManager clears all ConnectorPools when the process exits, which in turn means closing all connectors.
However it seems that closing a connector does not cancel the backend, can you confirm this?
What I see is that if you press CTRL-C on a process executing a long-running statement, the statement keeps running in Postgres (just as described in #1567 ).

Making Npgsql cancel all connectors on process exit might be a breaking change in the behaviour of Npgsql, so perhaps it would be best to enable this with an optional flag.

Just in case, I'm currently using Npgsql 3.1.9

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 16, 2017

Member

Are you saying that your application process (in which Npgsql is running) has been terminated but you're still seeing queries running on the PostgreSQL? That seems highly unlikely and would be a pretty serious issue on the PostgreSQL side (nothing to do with Npgsql really). Can you please double-check that this is the case?

Member

roji commented Jun 16, 2017

Are you saying that your application process (in which Npgsql is running) has been terminated but you're still seeing queries running on the PostgreSQL? That seems highly unlikely and would be a pretty serious issue on the PostgreSQL side (nothing to do with Npgsql really). Can you please double-check that this is the case?

@mausch

This comment has been minimized.

Show comment
Hide comment
@mausch

mausch Jun 19, 2017

Yes, this is indeed how Postgres works. I had to learn it the hard way...
You can test it by running SELECT pg_sleep(1000); from a console app, pressing CTRL-C and then see in pg_stat_activity how the sleep still keeps running.
Just as described in #1567 , Postgres only checks for the connection status when it attempts to write some results. Closing the connection does not cancel the backend.

mausch commented Jun 19, 2017

Yes, this is indeed how Postgres works. I had to learn it the hard way...
You can test it by running SELECT pg_sleep(1000); from a console app, pressing CTRL-C and then see in pg_stat_activity how the sleep still keeps running.
Just as described in #1567 , Postgres only checks for the connection status when it attempts to write some results. Closing the connection does not cancel the backend.

@roji roji added enhancement and removed waiting for answer labels Jun 19, 2017

@roji roji added this to the 3.3 milestone Jun 19, 2017

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 19, 2017

Member

On second thought, knowing the PostgreSQL architecture it makes sense that things work in this way. I'll try to have this done for 3.3.

Note that right now Npgsql doesn't actually track busy connections internally, only idle ones. I'll need to think about the performance cost of maintain tracking for busy (pooled) connections, exclusively for this feature.

Member

roji commented Jun 19, 2017

On second thought, knowing the PostgreSQL architecture it makes sense that things work in this way. I'll try to have this done for 3.3.

Note that right now Npgsql doesn't actually track busy connections internally, only idle ones. I'll need to think about the performance cost of maintain tracking for busy (pooled) connections, exclusively for this feature.

@ellisonch

This comment has been minimized.

Show comment
Hide comment
@ellisonch

ellisonch Jun 28, 2017

I believe I'm running into the same sort of problem as mausch (although let me know if this belongs on another issue).

Right now, if I Ctrl-C my program while it's in the middle of a long running query, even though the program shuts down, the query keeps running in postgres. This means I have to go over to postgres and kill the query manually. During development, I have to do this over and over and over.

I had naively thought that using using would keep everything tidy, but upon reflection, I'm not sure there's anything Npgsql can do directly, since it doesn't know when the program gets killed (but maybe?). At any rate, I decided to catch the Ctrl-C and make sure everything gets closed tidily. Just to make sure we're communicating effectively ( 😄 ) I've made a basic Npgsql 3.2.4.1 example:

static NpgsqlConnection _conn;
static NpgsqlCommand _command;

static void Main(string[] args) {
	var cstring = _secrets + "CommandTimeout=3600;Pooling=false;ApplicationName=SleeperTest";
	using (_conn = new NpgsqlConnection(cstring)) {
		_conn.Open();
		using (_command = new NpgsqlCommand("select pg_sleep(1000);", _conn)) {
			Console.WriteLine("Sleeping...");
			_command.ExecuteNonQuery();
		}
	}
}

So to reiterate the behavior I see, when I start this program up, the pg_sleep query shows up under postgresql's pg_stat_activity. However, while Ctrl-Cing the program does kill it, it does not stop the query from running, which continues to show up in pg_stat_activity. To get rid of the zombie query, I have to go through the extra step of killing it manually in the db.

So, I tried to address this by capturing the Ctrl-C event so I could make sure to close the connection before the program exits. I try the most obvious thing first by adding

Console.CancelKeyPress += delegate { if (_conn != null) { _conn.Close(); } };

to the top of main. However, although the handler gets called, the Close() resumes running the query and finishes out the program normally! The same thing happens if I use _conn.Dispose(). I did not expect this. I definitely expected that closing a connection would cancel whatever command it was running and, well, close.

Okay, that didn't work. So I try adding instead

Console.CancelKeyPress += delegate { if (_command != null) { _command.Cancel(); } };

This successfully kills the query and shuts down the program immediately afterwards! This is exactly the behavior I want, but it requires I know what command I need to kill when Ctrl-C gets hit. My program might be in the middle of executing any of 20 different commands. I don't see any way of getting the command from the connection, or of telling the connection directly to stop gracefully.

If I'm coming at this from the wrong direction and there's a better way to make sure queries get killed, I'd love to know. Other people must be running into this problem, and if not, I'd like to know how they're avoiding it.

ellisonch commented Jun 28, 2017

I believe I'm running into the same sort of problem as mausch (although let me know if this belongs on another issue).

Right now, if I Ctrl-C my program while it's in the middle of a long running query, even though the program shuts down, the query keeps running in postgres. This means I have to go over to postgres and kill the query manually. During development, I have to do this over and over and over.

I had naively thought that using using would keep everything tidy, but upon reflection, I'm not sure there's anything Npgsql can do directly, since it doesn't know when the program gets killed (but maybe?). At any rate, I decided to catch the Ctrl-C and make sure everything gets closed tidily. Just to make sure we're communicating effectively ( 😄 ) I've made a basic Npgsql 3.2.4.1 example:

static NpgsqlConnection _conn;
static NpgsqlCommand _command;

static void Main(string[] args) {
	var cstring = _secrets + "CommandTimeout=3600;Pooling=false;ApplicationName=SleeperTest";
	using (_conn = new NpgsqlConnection(cstring)) {
		_conn.Open();
		using (_command = new NpgsqlCommand("select pg_sleep(1000);", _conn)) {
			Console.WriteLine("Sleeping...");
			_command.ExecuteNonQuery();
		}
	}
}

So to reiterate the behavior I see, when I start this program up, the pg_sleep query shows up under postgresql's pg_stat_activity. However, while Ctrl-Cing the program does kill it, it does not stop the query from running, which continues to show up in pg_stat_activity. To get rid of the zombie query, I have to go through the extra step of killing it manually in the db.

So, I tried to address this by capturing the Ctrl-C event so I could make sure to close the connection before the program exits. I try the most obvious thing first by adding

Console.CancelKeyPress += delegate { if (_conn != null) { _conn.Close(); } };

to the top of main. However, although the handler gets called, the Close() resumes running the query and finishes out the program normally! The same thing happens if I use _conn.Dispose(). I did not expect this. I definitely expected that closing a connection would cancel whatever command it was running and, well, close.

Okay, that didn't work. So I try adding instead

Console.CancelKeyPress += delegate { if (_command != null) { _command.Cancel(); } };

This successfully kills the query and shuts down the program immediately afterwards! This is exactly the behavior I want, but it requires I know what command I need to kill when Ctrl-C gets hit. My program might be in the middle of executing any of 20 different commands. I don't see any way of getting the command from the connection, or of telling the connection directly to stop gracefully.

If I'm coming at this from the wrong direction and there's a better way to make sure queries get killed, I'd love to know. Other people must be running into this problem, and if not, I'd like to know how they're avoiding it.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 28, 2017

Member

You understand the issue correctly and you went the right way, even if there's a problem with your proposed solution. Here are a few comments.

Regarding your first attempt (close connection on ^C)... Npgsql doesn't currently support thread-safe closing of a busy connection. You can close a connection which has an open reader (or an ongoing COPY process), but if another thread happens to be using the connection at that instant the behavior is completely undefined. In other words, if you happen to do ^C which a thread is doing something with the connection, things will break in completely unpredictable ways. #1127 and #1162 track providing a safe, reliable way to kill a connection.

Regardless, cancelling all running commands is definitely the way to go - it's a safe mechanism that opens a separate connection and is therefore completely thread-safe. However, ^C is only one way to kill a program - someone might kill the process (e.g. via a GUI or the Linux kill command), or your program might do Environment.Exit - so handling that case specifically isn't a good idea. .NET provides a hook for performing actions when the process shuts down, and Npgsql already uses that to cleanly close idle connections in the connection pool. Something similar could be done for busy connections.

One last note - aside from idle connections in the pool, and connections executing commands, there's also idle connections which the user holds (i.e. connections outside the pool which aren't running any command). These should ideally also be taken care of.

Member

roji commented Jun 28, 2017

You understand the issue correctly and you went the right way, even if there's a problem with your proposed solution. Here are a few comments.

Regarding your first attempt (close connection on ^C)... Npgsql doesn't currently support thread-safe closing of a busy connection. You can close a connection which has an open reader (or an ongoing COPY process), but if another thread happens to be using the connection at that instant the behavior is completely undefined. In other words, if you happen to do ^C which a thread is doing something with the connection, things will break in completely unpredictable ways. #1127 and #1162 track providing a safe, reliable way to kill a connection.

Regardless, cancelling all running commands is definitely the way to go - it's a safe mechanism that opens a separate connection and is therefore completely thread-safe. However, ^C is only one way to kill a program - someone might kill the process (e.g. via a GUI or the Linux kill command), or your program might do Environment.Exit - so handling that case specifically isn't a good idea. .NET provides a hook for performing actions when the process shuts down, and Npgsql already uses that to cleanly close idle connections in the connection pool. Something similar could be done for busy connections.

One last note - aside from idle connections in the pool, and connections executing commands, there's also idle connections which the user holds (i.e. connections outside the pool which aren't running any command). These should ideally also be taken care of.

@ellisonch

This comment has been minimized.

Show comment
Hide comment
@ellisonch

ellisonch Jun 28, 2017

Based on your feedback, I guess until there's something better I'll just keep a big central list of all the commands I generate, and on exit, I'll cancel them all.

FYI, AppDomain.CurrentDomain.DomainUnload and AppDomain.CurrentDomain.ProcessExit do not handle Ctrl-C, at least on some systems (including mine), nor do they handle just closing the console window itself (see e.g., https://stackoverflow.com/a/907123/2877032 and similar).

And of course, there's no way catch program exit on power failure :). I'm not looking for perfection. I know it's impossible. I'm just trying to get rid of one of the steps in my "write code->compile->test->kill program->kill query-> ..." development loop. Thanks for your help!

ellisonch commented Jun 28, 2017

Based on your feedback, I guess until there's something better I'll just keep a big central list of all the commands I generate, and on exit, I'll cancel them all.

FYI, AppDomain.CurrentDomain.DomainUnload and AppDomain.CurrentDomain.ProcessExit do not handle Ctrl-C, at least on some systems (including mine), nor do they handle just closing the console window itself (see e.g., https://stackoverflow.com/a/907123/2877032 and similar).

And of course, there's no way catch program exit on power failure :). I'm not looking for perfection. I know it's impossible. I'm just trying to get rid of one of the steps in my "write code->compile->test->kill program->kill query-> ..." development loop. Thanks for your help!

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 28, 2017

Member

@ellisonch there's a very good chance that Npgsql will take care of this internally soon, either in 3.3 or maybe even in a patch release, of course you're welcome to implement whatever you want until that comes out (or in general).

I'm surprised to learn the process exit domain handlers don't get called in some cases, I'll try to look into it when I start working on this issue.

Member

roji commented Jun 28, 2017

@ellisonch there's a very good chance that Npgsql will take care of this internally soon, either in 3.3 or maybe even in a patch release, of course you're welcome to implement whatever you want until that comes out (or in general).

I'm surprised to learn the process exit domain handlers don't get called in some cases, I'll try to look into it when I start working on this issue.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jul 7, 2017

Member

Currently, Npgsql has an event registration on AppDomain.CurrentDomain.DomainUnload and AppDomain.CurrentDomain.ProcessExit, which simply calls NpgsqlConnection.ClearAll(), taking care of idle connections in the pool.

The way I'm thinking of handling this, is to simply track currently-open physical connections globally - whether they're in a pool or otherwise. Since physical open and close are already a heavy operation, an additional bookkeeping operation wouldn't have much performance impact.

When a process shutdown is triggered, we would go over the list of all open physical connections and shut each down according to its state. If it's idle, close it cleanly, like we do today with idle connections in the pool - except this would also take care of idle connections the user is holding. If it's busy, send a cancel on the current command.

Note that there's lots of potential for race conditions as the shutdown hook closes/cancels connections as user code is still executing.

Member

roji commented Jul 7, 2017

Currently, Npgsql has an event registration on AppDomain.CurrentDomain.DomainUnload and AppDomain.CurrentDomain.ProcessExit, which simply calls NpgsqlConnection.ClearAll(), taking care of idle connections in the pool.

The way I'm thinking of handling this, is to simply track currently-open physical connections globally - whether they're in a pool or otherwise. Since physical open and close are already a heavy operation, an additional bookkeeping operation wouldn't have much performance impact.

When a process shutdown is triggered, we would go over the list of all open physical connections and shut each down according to its state. If it's idle, close it cleanly, like we do today with idle connections in the pool - except this would also take care of idle connections the user is holding. If it's busy, send a cancel on the current command.

Note that there's lots of potential for race conditions as the shutdown hook closes/cancels connections as user code is still executing.

@Kikimora

This comment has been minimized.

Show comment
Hide comment
@Kikimora

Kikimora Jul 14, 2018

I see same issue in a slightly different context. My server takes few advisory locks and keeps them locked until it shutdown. If this server force closed by Kubernetis or Ctrl+C then advisory locks keep hanging as if sessions were still active.
Unfortunately I cannot reproduce this reliably. I saw this behaviour just two time - one time on server after node were shutdown due to high CPU usage. Another time this issue happened when test were shutdown by timeout.

Kikimora commented Jul 14, 2018

I see same issue in a slightly different context. My server takes few advisory locks and keeps them locked until it shutdown. If this server force closed by Kubernetis or Ctrl+C then advisory locks keep hanging as if sessions were still active.
Unfortunately I cannot reproduce this reliably. I saw this behaviour just two time - one time on server after node were shutdown due to high CPU usage. Another time this issue happened when test were shutdown by timeout.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jul 14, 2018

Member

@Kikimora, the issue you're describing would make sense of your're process is terminated just it's executing a relatively long query: while a query is still executing, PostgreSQL does not check the socket and is unaware that it has been dropped on the client side. If there's no running query PostgreSQL should be aware immediately that the socket had been closed client side, and terminate on it's end, releasing any advisory locks.

It would be good if you could confirm this - it would mean your problem is identical to what has been anniversary in this issue. Otherwise we may have something new.

This kind of thing is hard to reproduce reliably, you may want to try to recreate the situation that happened with your test shutting form by timeout, possibly using a loop to do the same thing over and over until the error triggers.

Member

roji commented Jul 14, 2018

@Kikimora, the issue you're describing would make sense of your're process is terminated just it's executing a relatively long query: while a query is still executing, PostgreSQL does not check the socket and is unaware that it has been dropped on the client side. If there's no running query PostgreSQL should be aware immediately that the socket had been closed client side, and terminate on it's end, releasing any advisory locks.

It would be good if you could confirm this - it would mean your problem is identical to what has been anniversary in this issue. Otherwise we may have something new.

This kind of thing is hard to reproduce reliably, you may want to try to recreate the situation that happened with your test shutting form by timeout, possibly using a loop to do the same thing over and over until the error triggers.

@Kikimora

This comment has been minimized.

Show comment
Hide comment
@Kikimora

Kikimora Jul 16, 2018

@roji We are seeing this quite often, few times a day with team of 10 developers, both in test and production environment. We don't run long queries, we just keep connection opened while process is running. Connection opened at process start, then we acquire few advisory locks with this connection and keep it opened until process terminates. We do this to prevent other similar processes to start by accident or due to mistake. The data is sensitive and two similar running processes could corrupt it hence we protect it with distributed locks.

One way this problem manifest itself in staging and production environment is when server running multiple docker containers get shut down by Azure due to overload. In this case session level advisory locks always stays in the database although docker container with our service already dead.

Kikimora commented Jul 16, 2018

@roji We are seeing this quite often, few times a day with team of 10 developers, both in test and production environment. We don't run long queries, we just keep connection opened while process is running. Connection opened at process start, then we acquire few advisory locks with this connection and keep it opened until process terminates. We do this to prevent other similar processes to start by accident or due to mistake. The data is sensitive and two similar running processes could corrupt it hence we protect it with distributed locks.

One way this problem manifest itself in staging and production environment is when server running multiple docker containers get shut down by Azure due to overload. In this case session level advisory locks always stays in the database although docker container with our service already dead.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jul 17, 2018

Member

@Kikimora, the crucial question is here is whether a (long) query is running at the moment when the process goes down. If it is, then it makes sense for the advisory locks to remain taken until that query completes, because PostgreSQL simply isn't aware that the connection has been broken until it finishes processing the query. If, on the other hand, your process is completely idle - simply keeping a connection open with a taken advisory lock - then PostgreSQL should immediately detect that the connection is closed and release the locks. If it isn't, that's probably an issue on the PostgreSQL side. By the way, does the advisory lock remain taken forever? Or does it expire (get released) on its own after some time?

Honestly, I would be surprised if PostgreSQL really had such an issue (not detecting connections breaking while they are idle). The best way forward here would be to demonstrate the issue via a small, independent program: establish a connection, take an advisory lock, then kill the process and show that the lock is still taken some time later.

This issue is about the 1st scenario (locks not released until a long query completes), with the idea of sending PostgreSQL cancellations for all currently-running queries when shutting down. If you're convinced that you're seeing the 2nd issue, then please open a separate issue.

Member

roji commented Jul 17, 2018

@Kikimora, the crucial question is here is whether a (long) query is running at the moment when the process goes down. If it is, then it makes sense for the advisory locks to remain taken until that query completes, because PostgreSQL simply isn't aware that the connection has been broken until it finishes processing the query. If, on the other hand, your process is completely idle - simply keeping a connection open with a taken advisory lock - then PostgreSQL should immediately detect that the connection is closed and release the locks. If it isn't, that's probably an issue on the PostgreSQL side. By the way, does the advisory lock remain taken forever? Or does it expire (get released) on its own after some time?

Honestly, I would be surprised if PostgreSQL really had such an issue (not detecting connections breaking while they are idle). The best way forward here would be to demonstrate the issue via a small, independent program: establish a connection, take an advisory lock, then kill the process and show that the lock is still taken some time later.

This issue is about the 1st scenario (locks not released until a long query completes), with the idea of sending PostgreSQL cancellations for all currently-running queries when shutting down. If you're convinced that you're seeing the 2nd issue, then please open a separate issue.

@Kikimora

This comment has been minimized.

Show comment
Hide comment
@Kikimora

Kikimora Jul 18, 2018

@roji I'll try to make a sample which reliably triggers this behaviour.

There are no queries running at this moment, just an idle connection.

By the way, does the advisory lock remain taken forever? Or does it expire (get released) on its own after some time?

Locks released after some time as far as I can tell. But this time is quite long, maybe 30 minutes, maybe even and hour.

Kikimora commented Jul 18, 2018

@roji I'll try to make a sample which reliably triggers this behaviour.

There are no queries running at this moment, just an idle connection.

By the way, does the advisory lock remain taken forever? Or does it expire (get released) on its own after some time?

Locks released after some time as far as I can tell. But this time is quite long, maybe 30 minutes, maybe even and hour.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jul 18, 2018

Member

Thanks. What you're describing is really strange, I hope to see a repro for it and will investigate once I do.

Member

roji commented Jul 18, 2018

Thanks. What you're describing is really strange, I hope to see a repro for it and will investigate once I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment