Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make timeouts aggregated #1041

Closed
roji opened this issue Apr 22, 2016 · 31 comments
Closed

Make timeouts aggregated #1041

roji opened this issue Apr 22, 2016 · 31 comments

Comments

@roji
Copy link
Member

roji commented Apr 22, 2016

Following discussion in #689 (comment), it seems to make sense to make our CommandTimeout behave lie SqlClient's (see docs).

This means that all socket read time is aggregated for a single ADO.NET API call (e.g. Read()). Currently we simply set the socket's ReceiveTimeout to the CommandTimeout, which means that if multiple socket reads happen within the same Read(), the timeout is reset each time. It seems to make sense to aggregate the CommandTimeout across all of Read()'s network reads instead.

Comments from anyone?

@roji roji self-assigned this Apr 22, 2016
@roji roji added this to the 3.1 milestone Apr 22, 2016
@roji
Copy link
Member Author

roji commented Apr 28, 2016

Note: check the situation with the connection timeout as well, are we doing it right?

@roji roji modified the milestones: 3.2, 3.1 May 5, 2016
@roji roji modified the milestone: 3.2 Sep 24, 2016
@realityexists
Copy link

Should CommandTimeout be the time per read or the time for all reads? I can see a few scenarios:

  1. The user just wants an overall timeout for reading all results. Obviously the SqlClient behaviour is simpler here, but the user could easily implement an overall timeout themselves - just start a Stopwatch and check the time elapsed after each Read().
  2. The user want a timeout for the first result, but then is happy to take as long as they need to reading the results. An overall timeout doesn't work for this, so the current Npgsql behaviour is better.
  3. The user wants a timeout for the first result and a different (usually much shorter) timeout for each subsequent result. Current Npgsql behaviour is better for this, too, as long as it doesn't cache the value of CommandTimeout, so user code could change it after ExecuteReader has returned (and possibly even before each Read).

Compatibility with SqlClient is good, but backwards compatibility with Npgsql 3.1 is better. Combined with the current behaviour being more flexible, I'd say leave it like this.

@roji
Copy link
Member Author

roji commented Oct 31, 2016

The user just wants an overall timeout for reading all results. Obviously the SqlClient behaviour is simpler here, but the user could easily implement an overall timeout themselves - just start a Stopwatch and check the time elapsed after each Read().

If by "reading all results" you mean all result rows of a command, then this isn't what SqlClient does (or what this issue was proposing to do) - SqlClient applies timeouts to a single ADO.NET method invocation (e.g. Read()), and only aggregates I/O time spent performing on that invocation for the purpose of timeouts.

As you say, if the user wants a client-side timeout for reading all results they can implement it themselves.

The user want a timeout for the first result, but then is happy to take as long as they need to reading the results. An overall timeout doesn't work for this, so the current Npgsql behaviour is better.

I definitely don't view this as a very valid/interesting scenario - one of the key points of timeouts is to avoid network issues which would block your program forever. I understand where you're coming from - you're thinking about long-running (or deadlocked) queries rather than network issues, but the latter must be addressed.

One more point - it's not necessarily correct to assume that PostgreSQL takes a long time upfront processing your query, and then spits results back quickly. AFAIK depending on exactly what you're doing it may return the first result very quickly but take long afterwards.

The user wants a timeout for the first result and a different (usually much shorter) timeout for each subsequent result. Current Npgsql behaviour is better for this, too, as long as it doesn't cache the value of CommandTimeout, so user code could change it after ExecuteReader has returned (and possibly even before each Read).

Even if this issue is implemented and we start aggregating I/O time, this would still be possible as aggregation only happens within a single ADO.NET method call (as I said above). So you would be free to play around with CommandTimeout between calls.

To summarize, this issue actually doesn't change behavior in a very significant way, since it only aggregates within a single method call.

@realityexists
Copy link

Ah, OK, ignore the above comment then - I misunderstood how it works.

@baal2000
Copy link
Contributor

+1 for a standard, provider-agnostic behavior.

@roji
Copy link
Member Author

roji commented Nov 25, 2018

Note that as I wrote in #2239, there's nothing standard about aggregating socket read time - it just happens to be the SqlClient behavior (i.e. not documented on DbCommand.CommandTimeout). I'm not sure what other providers do.

In addition, it isn't very clear whether it really is the desirable behavior for all scenarios. Both behaviors have their pros and cons.

@roji roji changed the title Make frontend timeouts aggregated Make timeouts aggregated Oct 1, 2020
@roji
Copy link
Member Author

roji commented Oct 1, 2020

@vonzshik you may be interested in this. I am really not sure we should actually implement this - there are pros and cons of both the aggregate and non-aggregate timeout behaviors. But as you've been working and thinking about timeouts recently, it would be good to have your thoughts.

@roji
Copy link
Member Author

roji commented Oct 1, 2020

Also /cc @YohDeadfall and @Brar. Since we're redoing the timeout/cancellation story for 5.0, it may be a good idea to at least think about this a bit.

@roji
Copy link
Member Author

roji commented Oct 1, 2020

Note that I'm not completely sold on the general idea of a cumulative timeout.

The two main uses I see for the timeout in general are to cancel a never-ending query (deadlocked, or simply taking too long because of perf issues), and to handle network partitions; in both those cases, no data arrives at all. As long as data does keep arriving for a query, I'm not sure users would want us to cancel that query just because the total time spent reading exceeded something.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 1, 2020

As long as data does keep arriving for a query, I'm not sure users would want us to cancel that query just because the total time spent reading exceeded something.

That is the only case I see right now, when our current implementation doesn't work very well. The data is trickling down very-very slowly, but not slow enough for the timeout. If we have a 30 seconds timeout and 5 TCP reads per DataReader.Read, it might take 145 seconds (29 seconds per one TCP read) until something is returned to a user.

@roji
Copy link
Member Author

roji commented Oct 1, 2020

@vonzshik yes, that's the argument in favor of applying CommandTimeout cumulatively (to all TCP reads in Read). But I'm still not convinced users really want to cancel a "trickling" query - we've never really received any feedback on this mechanism; the main use of CommandTimeout is definitely for cases where no data comes at all.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 1, 2020

As long as that corner case is documented, I see no reason to make timeouts aggregated.

@roji
Copy link
Member Author

roji commented Oct 1, 2020

I've added a note in npgsql/doc#47 re documentation. Let's see if @YohDeadfall and/or @Brar have other thoughts on aggregation, otherwise I think it's fine to leave this open in the backlog for users to comment on it.

@Brar
Copy link
Member

Brar commented Oct 1, 2020

it just happens to be the SqlClient behavior (i.e. not documented on DbCommand.CommandTimeout). I'm not sure what other providers do.

If there is some sort of standard or guidance on the ADO.NET level I'd generally vote to adhere to it but I'm not convinced that everything SqlClient does should automatically be assumed as the standard.

But I'm still not convinced users really want to cancel a "trickling" query

Well, they could be. Who knows?
OTOH, if they want to, I'd consider it unsupported on the sync API and really easy on the async API.
Just create your own CancellationTokenSource set your timeout on it and pass it's CancellationToken to NpgsqlDataReader.ReadAsync() in every iteration..

As I understand it, they'd currently break the connection in case of a cancellation as we currently don't support graceful (server side) cancellation on NpgsqlDataReader.ReadAsync() but we could probably change that in the future.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 1, 2020

As I understand it, they'd currently break the connection in case of a cancellation as we currently don't support graceful (server side) cancellation on NpgsqlDataReader.ReadAsync() but we could probably change that in the future.

Right now, it could be emulated with NpgsqlCommand.Cancel. But with #3191, it also should be possible by cancelling the cancellation token.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 1, 2020

But NpgsqlDataReader.ReadAsync() is a curios case, as most of the time it returns some data from the buffer, and doesn't do any IO.

@Brar
Copy link
Member

Brar commented Oct 1, 2020

But NpgsqlDataReader.ReadAsync() is a curios case, as most of the time it returns some data from the buffer, and doesn't do any IO.

Are you sure?
IIRC we buffer (at most, only in non-sequential mode) one row at a time, so ReadAsync() is supposed to do IO on every request.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 1, 2020

But NpgsqlDataReader.ReadAsync() is a curios case, as most of the time it returns some data from the buffer, and doesn't do any IO.

Are you sure?
IIRC we buffer (at most, only in non-sequential mode) one row at a time, so ReadAsync() is supposed to do IO on every request.

For a query "SELECT * from generate_series(1, 1000000)", it takes around 542 reads (where we do FastRead), until there is some IO.

@Brar
Copy link
Member

Brar commented Oct 1, 2020

For a query "SELECT * from generate_series(1, 1000000)", it takes around 542 reads (where we do FastRead), until there is some IO.

Ah, yes.
We buffer as much as possible (as much as we get sent from the server in one request and fits in our buffer) but at least one complete row in non-sequential mode.
Thanks!

@Brar
Copy link
Member

Brar commented Oct 1, 2020

Anyways, as our consumers shouldn't make any assumptions when ReadAsync() does IO (and for big rows it could happen once per row) they probably should assume it happens on every call (as opposed to GetFieldValue<T>() et al. where you can rely on the fact that no IO happens in non-sequential mode).

@bgrainger
Copy link
Contributor

FWIW, MySqlConnector tries to implement the same behaviour (aggregate timeouts) as SqlClient. MySQL sends each row in a result set as a single "MySQL packet" (may be bundled up into a single TCP packet). So unless the rows are huge, the data is usually already there, or retrieved with a single Receive from the Socket. If the rows are very large, multiple socket reads will be required, and the timeout applies to the total of all those operations, until the full row is buffered in memory. (MySqlConnector doesn't support "streaming" with SequentialAcccess.)

I think this makes the most sense from a user perspective: if CommandTimeout is 30, MySqlDataReader.Read() will either return true, false, or throw an exception within 30 seconds, no matter how many network reads (0, 1, or many) are required.

@roji
Copy link
Member Author

roji commented Oct 3, 2020

Re ReadAsync, note that we check the cancellation token up-front regardless of if we already have a buffered row or not, so @Brar's note above is correct - users can always manage a cancellation token themselves and pass it to all operations.

But the main question here is what we think the driver behavior should actually be - I'm honestly not sure. Ignoring the SqlClient behavior for second, CommandTimeout could mean "total time spent reading", or it could mean "time spent reading without receiving any data". My personal feeling/assumption is that most/all users really need the 2nd (see #1041 (comment)), but implementing the second may also be fine (most users are unlikely to care about the difference). @bgrainger any thoughts on why aggregate timeouts are better?

However, if we do decide to implement this, it's a very good to do it for 5.0, along with all the rest of the timeout changes (otherwise we'll have another significant behavioral change later). So I'll move this issue to 5.0 for now - we can either implement or close it.

@roji roji modified the milestones: Backlog, 5.0.0 Oct 3, 2020
@vonzshik
Copy link
Contributor

vonzshik commented Oct 3, 2020

I personally prefer our current implementation, as it's actually more flexible. If you don't have any problems with the network, you won't see any difference. If you do have them, we've also got you here.

@roji
Copy link
Member Author

roji commented Oct 3, 2020

If you don't have any problems with the network, you won't see any difference. If you do have them, we've also got you here.

You're right that if there are network problems, there's no difference.

But if there aren't network problems, I do think there's the question of "trickling" queries, where results come in little by little. Does an average user want to cancel such a query after X seconds, even though data is still coming? We have no data here to answer the question, but my gut tells me the average user would not want to cancel it. And if they do, as @Brar pointed out they can always manage a cancellation token themselves for the duration of the operation.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 3, 2020

But if there aren't network problems, I do think there's the question of "trickling" queries, where results come in little by little.

Isn't that the definition of the issues with the network? AFAIK, this isn't PG's behavior.

You're right that if there are network problems, there's no difference.

I'm talking about the case, when a query takes too long to complete on a PG side. We will still be stuck, waiting for a response from the PG, and eventually fail with a timeout.

@roji
Copy link
Member Author

roji commented Oct 3, 2020

Depending on exactly what the query does, I'm pretty sure it's possible to receive a single row, then wait, then receive another row, then wait, etc. - with perfect network conditions. See Robert Haas's answer here.

I think this is the only relevant scenario for discussion on whether to aggregate or not.

@vonzshik
Copy link
Contributor

vonzshik commented Oct 3, 2020

Depending on exactly what the query does, I'm pretty sure it's possible to receive a single row, then wait, then receive another row, then wait, etc. - with perfect network conditions. See Robert Haas's answer here.

Wouldn't it happen during DataReader.Read?

@roji
Copy link
Member Author

roji commented Oct 3, 2020

Yes, that's a very good point. That probably means that within a single Read (or any other single operation), it's not important whether we aggregate or not. The distinction would have been important had we applied CommandTimeout to the entire command, but we don't do that.

I'd like to wait for @bgrainger's input on this - am interested if things are somehow different MySQL. But otherwise it seems fine to close this.

@bgrainger
Copy link
Contributor

CommandTimeout could mean "total time spent reading", or it could mean "time spent reading without receiving any data". My personal feeling/assumption is that most/all users really need the 2nd (see #1041 (comment)), but implementing the second may also be fine (most users are unlikely to care about the difference).

"the 2nd ... the second may also" is confusing; not sure which one you meant.

That probably means that within a single Read (or any other single operation), it's not important whether we aggregate or not. The distinction would have been important had we applied CommandTimeout to the entire command, but we don't do that.

I agree with this.

It's hard for me to speak in broad generalities since there are so many varieties of MySQL Server (MySQL, MariaDB, Amazon RDS, Aurora, etc.; v5.6, v5.7, v8.0) but what I've observed is that MySQL tends to take a long time to compute the first row in a result set (this is time in ExecuteReader) then constantly streams rows back. There may be multiple calls to Socket.Receive to read the packets that make up the resultset header, but during Read, assuming the rows are small enough, I would expect each one to be retrieved with one call to Receive.

Unfortunately I don't have any hard data, but I suspect that it generally doesn't matter whether timeouts are aggregated for a single call to ExecuteReader or Read. Perhaps there are pathological cases where Socket.Receive returns just one byte every time, and it's necessary to use aggregated timeouts to prevent extremely long operations. Maybe the SqlClient team chose that approach out of an abundance of caution?

@bgrainger
Copy link
Contributor

MySqlConnector does implement the aggregate timeout approach but IIRC it's more out of worry that that might be necessary in extreme situations, rather than knowing of particular cases where it's necessary for correctness.

@roji
Copy link
Member Author

roji commented Oct 3, 2020

@bgrainger thanks for your input.

what I've observed is that MySQL tends to take a long time to compute the first row in a result set (this is time in ExecuteReader) then constantly streams rows back.

On the PG side, we do know that rows may have long delays in between them (#1041 (comment)). However, as @vonzshik correctly pointed out, aggregated timeouts only matter if we think that are going to be delays within data for a single row, which really does seem very unlikely.

I'm going to close this issue as not needed, until we get some evidence that says otherwise.

@roji roji closed this as completed Oct 3, 2020
@roji roji removed this from the 5.0.0 milestone Oct 3, 2020
@roji roji removed the enhancement label Oct 3, 2020
@roji roji removed their assignment Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants