Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible concurrency issue when handling auto prepared statements #2482

Open
pgrm opened this issue May 23, 2019 · 20 comments

Comments

@pgrm
Copy link

commented May 23, 2019

Steps to reproduce

still working on this part - https://github.com/pgrm/BoilerplatePostgresExample

The issue

as discussed in #2026 (comment)

We have a situation when 1 auto prepared statement for one connection might get corrupted. The local cache of what has been set as prepared statement, doesn't reflect the postgres information.

The error you can see is Npgsql.PostgresException (0x80004005): 08P01: bind message supplies 2 parameters, but prepared statement "_auto247" requires 14

Generally, the higher the number configured for Max Auto Prepare, the less often we see this error. I can now reliably reproduce it in our end to end tests. When I put the value as low as 50 I see roughly 2-4% of our tests fail at random. It's always different tests. If it were #2026 it should be the same tests failing.

The problem is, we have over a thousand end to end tests, each performing up to several dozen API calls and each of them up to several dozen DB queries. With roughly 40-50 requests performed in parallel. With this load and wide range of queries (the whole product) we get only a tiny fraction of failures. To fail a test, only one of those requests need to fail in only one query, so out of over 100k DB queries, less than 50 failed (0,05% or 0,5‰)

We've attempted to write a simple script - https://github.com/pgrm/BoilerplatePostgresExample but it doesn't work (or it does work and doesn't trigger the error). Clearly if that simple code would have found the error somebody else would have already mentioned it long time before.

@roji how can we attempt to reproduce the variety of those queries that we perform, without spending too much time? Do you think that it might be related to how ASP.Net Core is dealing with the DbContext and therefore we should rather try and reproduce it fully as an api with calls?

Further technical details

Npgsql version: 4.0.7
PostgreSQL version: 11
Operating system: linux (running in docker when those issues happen)

@YohDeadfall

This comment has been minimized.

Copy link
Member

commented May 23, 2019

Is it a duplicate of #2026 where you commented?

@pgrm

This comment has been minimized.

Copy link
Author

commented May 23, 2019

@YohDeadfall no, see the last comment, why I created a new issue - #2026 (comment)

One more thing - when you do have a repro, please open a new issue. This issue is specifically the interaction between auto-preparation and batching ...

I don't have a repro yet, but we've tried to create it and failed horrendously, so I thought to create the issue anyway to ask for advice what else to consider when trying to reproduce the problem

@YohDeadfall YohDeadfall added the bug label May 23, 2019

@pgrm

This comment has been minimized.

Copy link
Author

commented May 24, 2019

FYI - this might be obvious to you guys, just wanted to mention it explicitly. It can also be that it picks a wrong statement, where the number of parameters matches, which then can cause a series of completely weird errors starting from something like Npgsql.PostgresException: 22021: invalid byte sequence for encoding "UTF8": 0x00 to just application issues because the application received wrong data.

@roji

This comment has been minimized.

Copy link
Member

commented May 24, 2019

@roji how can we attempt to reproduce the variety of those queries that we perform, without spending too much time? Do you think that it might be related to how ASP.Net Core is dealing with the DbContext and therefore we should rather try and reproduce it fully as an api with calls?

Concurrency-related issues are always very tricky to repro...

One thing I'd really try to rule out first, is some sort of bug in your own tests, in the sense that the same database connection (or DbContext) is somehow being used concurrently. That's unsupported and cause this kind of corruption.

One way to proceed, is to progressively bisect your test suite to narrow down what's causing it to happen. This would at least efficiently narrow down the amount of tests you have to pay attention too, and ideally may even pinpoint to one test suite (or even test) that has a concurrency bug (if the problem is in your code), or which triggers the Npgsql bug.

Good luck - as soon as we have some sort of repro we'll of course work hard on fixing this...

@roji

This comment has been minimized.

Copy link
Member

commented May 24, 2019

It can also be that it picks a wrong statement, where the number of parameters matches, which then can cause a series of completely weird errors starting from something like Npgsql.PostgresException: 22021: invalid byte sequence for encoding "UTF8": 0x00 to just application issues because the application received wrong data.

This is odd... The error isn't about the application receiving wrong data, but about it sending wrong data (PostgreSQL is complaining about having received an 0x00 within a string from Npgsql). I'm not really sure that this could not be caused by an autoprepare bug (would need to think/analyze more).

@pgrm

This comment has been minimized.

Copy link
Author

commented May 24, 2019

@roji - we're using asp.net core with dependency injection (based on Autofac). And we have static code analyzers to make sure, that every statement is awaited. So really, to what you're suggesting wouldn't that be a bug in autofac or Asp.Net Core?

Basically our end to end test suite is - spinning up the whole system with database in docker, and running actual API calls against it, just like it would look in production, except the load is ~100x

@pgrm pgrm closed this May 24, 2019

@pgrm pgrm reopened this May 24, 2019

@roji

This comment has been minimized.

Copy link
Member

commented May 24, 2019

So really, to what you're suggesting wouldn't that be a bug in autofac or Asp.Net Core?

It could be, but I wouldn't assume that there can't be bugs in your code just because you're using dependency injection and analyzers. For example, so part of your code - or tests - could be spinning up a thread (or sending some code to be executed in the thread pool), and accidentally reusing the same DbContext/connection.

More specifically, I don't know much about autofac, but I really doubt there's an ASP.NET dependency injection bug that would cause this. To be clear, I'm not saying this isn't a bug in Npgsql - it definitely could be. Am just advising to fully rule out an application bug first. Bisecting your test suite could help narrow things down - both in the case of an application bug and in the case of an Npgsql bug.

@pgrm

This comment has been minimized.

Copy link
Author

commented May 24, 2019

Ok thx, we definitely don't use any Thread, Task.Run or Task.Start in our application, just double checked again. The broken endpoints are from run to run always different, but once an endpoint breaks, usually it, or similar endpoints are responsible for all the broken tests - which makes sense, because I'm looking at queries which are more unique, and they might be executed from different, logically grouped, endpoints.

We do leverage ASP.Net Core IHostedService for all our background jobs, where we still use Autofac, but this time as a service locator. - Maybe there is a bug lying in there. Still weird that it would only surface when using auto prepared statements. Because if we were really to have an issue like that, it should also happen from time to time at least, even without using auto prepared statements or? - but that's never the case, without that function turned on, the application is extremely stable, just that we use 50-80% more CPU on the DB

I'm not saying, that we don't have a bug. If max autoprepare is the only parameter that we change, and based on the value we use it can change from making the application fail very often to almost never fail, to actually never fail, it either means that the parameter is a catalyst for a bug that didn't surface yet, or ...? - But I also understand that it might be weird, that we're the only ones complaining about it. Would be nice to know how many other systems use that parameter and how big they are. Ok, will check how the system works without those background services.

btw, now I see also other issues (they still only happen with the max auto prepare set to a small enough value): Received backend message BindComplete while expecting CommandCompleteMessage. Please file a bug. - I assume that can be explained with a concurrency issue as well, or?

@roji

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Thanks for looking into it and also for providing the additional information (especially the 50-80% gap with autoprepare).

I really am not saying there's not a bug in Npgsql's autoprepare feature - it's not the most thoroughly-tested feature in the driver. I do agree that if you never see any errors when autoprepare is off, then there's a good chance the bug is somewhere in there. Unfortunately we have little data on how many people are using it successfully, because we typically don't get many responses saying "it just works well, thanks".

I'd focus on narrowing down the scenario triggering the error, ideally to produce some sort of repro - regardless of whether the problem comes from the Npgsql side, from Autofac/ASP.NET (unlikely) or from the application.

However, can you please confirm you're not using large batching, i.e. batches that are around the same size as your MaxAutoPrepare? It doesn't sound like that's the case from your description, but if so this could simply be another case of #2026.

@McDoit

This comment has been minimized.

Copy link

commented Jun 10, 2019

We have seen similar issues with EF6 and AutoPrepare
Havnt been able to reproduce it as it has only occured 5-10 times over the last year
But might try to lower the MaxAutoPrepare and see if it starts failing more then

@pgrm

This comment has been minimized.

Copy link
Author

commented Jul 7, 2019

we just had an issue, where the prepared statement somehow broke in a very weird way. We have a call which is being executed ~1000 times per hour for that one customer on a node. And ~5% of those calls broke in a way that the prepared statement could be used, but returned somehow wrong data. No idea, what the prepared statement ended up being compared to what it was supposed to be. (How could I figure that out?) - It's probably not possible anymore, now that I restarted the node, to clear the connection pool, but in future it would maybe help understanding this issue.

So far nothing new, however, now that I know which endpoint is using which connection, I was surprised to find out, that the connection was last time used for exactly the same request. There was no other request in between which could have broken it. So basically 2 exact same requests were triggered after each other on the same connection, one worked and the other didn't work anymore. Anything you could deduce from this maybe @roji ? - anything I could try to look up additionally, to better see what might have happened? - maybe we'll need more logs again for the next time this issue might pop up.

@roji

This comment has been minimized.

Copy link
Member

commented Jul 7, 2019

@pgrm it's very difficult to say anything definite, the problem description doesn't include enough concrete information...

And ~5% of those calls broke in a way that the prepared statement could be used, but returned somehow wrong data

That sounds very odd. How do you know that prepared statements are to blame? You're saying no PostgreSQL or Npgsql error was triggered, but incorrect data was somehow returned?

So basically 2 exact same requests were triggered after each other on the same connection, one worked and the other didn't work anymore.

By "didn't work" are you referring to wrong data, or some sort of error?

Basically some sort of concrete information, and ideally a repro of a problem is the way forward here. It you can reproduce the problem by executing the client's call 1000 times (i.e. in an artificial tight loop), and submit that as a sort of minimal test, that would definitely be enough for me to look into it.

@pgrm

This comment has been minimized.

Copy link
Author

commented Jul 8, 2019

How do you know that prepared statements are to blame?

It's of course only circumstantial. This is what I know:

  • Last time we got those errors, was when we turned on prepared statements, but with a much lower number. We ended up turning off prepared statements again, because it happened too often, and we never saw the error after that (until last weekend)
  • I double checked the logs now, and on that one instance:
    • the requests for that connection ID were fine until the first one failed
    • all subsequent requests for that one customer with that one particular postrgres connection ID failed
    • no requests for that one customer with a different postgres conection ID failed
    • other customers were not affected, most likely because we're building some query similar to where x in (1, 2, 3, 4, 5, ... 100, 101) - that in part is not translated to parameters and therefore unique per customer.
  • We definitely do have an in-memory cache for some things, but that one isn't restricted per postgres connection ID, and we don't have anything at all, in our logic, which is shared across requests but limited to a particular postgres connection. That is everything inside npsgsql (or maybe something inside EF Core)

Basically, because we have those where x in (1, 2, 3, 4, 5, ... 100, 101) - which aren't converted to parameters, it's entirely possible, that it's a similar issue as with Npgsql.PostgresException (0x80004005): 08P01: bind message supplies 2 parameters, but prepared statement "_auto247" requires 14 - just that the number of parameters and structure of the result match.

If you can reproduce the problem by executing the client's call 1000 times (i.e. in an artificial tight loop), and submit that as a sort of minimal test, that would definitely be enough for me to look into it.

that's the thing, this endpoint is being called 400k times per day per node (although for different clients) - and it took several months until it broke ones. I really thought, I'd be able to reproduce the error, when we tried last time, but it seems like we're missing something. I had hoped the last time the connection was used before, would have had some error, and because error cases aren't tested that often, that might lead me to the issue, but there was no error on that connection id reported before it started breaking

@roji

This comment has been minimized.

Copy link
Member

commented Jul 8, 2019

I see what you mean. It's conceivable that there's a bug that somehow "switches around" prepared statements. In most cases this would indeed cause the error "bind message supplies X parameter, but prepared statement "_autoXXX" requires Y", but if the parameter number and types happened to correspond, it would silently "work" and invoke the wrong statement.

However, I still am not sure how this could happen, and would probably need some sort of repro. A separate prepared statement cache exists on each physical connection, so the potential for concurrency bugs is limited (unless the user themselves are somehow using the same physical connection concurrently). Beyond that, if you look at the auto-prepare implementation you'll see it's basically a dictionary keyed on the SQL (with an additional check to ensure that parameters match). It's hard to see how such a bug could happen there - but definitely not impossible.

I'll keep this in mind and try to analyze the code again; if anyone else wants to look that could be very helpful - a fresh set of eyes can sometimes do wonders. But a repro is definitely the best way forward here: some sort of console program executing a tight loop that cycles through prepared statements etc.

@roji

This comment has been minimized.

Copy link
Member

commented Jul 8, 2019

One more unrelated note: doing queries with WHERE x IN (....), where the value list is completely variable, is problematic in various ways: it could cause a lot of turnover in the prepared statement cache, hurting perf in a possibly significant way.

It the variability in the number of values really is wide (i.e. sometimes 100 values, sometimes 2), consider rewriting this as WHERE x = ANY (@arrayParameter). This would allow one statement to be used regardless of the number of parameters - but obviously test and benchmark this.

@pgrm

This comment has been minimized.

Copy link
Author

commented Jul 8, 2019

But a repro is definitely the best way forward here: some sort of console program executing a tight loop that cycles through prepared statements etc.

I totally agree, and we did want to build one, spend a few days on it, but it just didn't want to break :( - I'm back at analyzing our code, if we didn't miss some await in some hidden function

consider rewriting this as WHERE x = ANY (@arrayParameter)

We're using EF Core for it, simply .Where(x => items.Contains(x)) - and when we were writing those queries like that, we weren't really considering if it's going to be 5 values inside items or 100. Probably this could be added in future via EF.Functions - Thx for the hint, will keep it in mind.

In case we'll find more, I'll be sure to update this thread :)

@roji

This comment has been minimized.

Copy link
Member

commented Jul 8, 2019

Oh I see - I happened to just open npgsql/Npgsql.EntityFrameworkCore.PostgreSQL#916 to make the Npgsql provider generate ANY() for Contains. Unfortunately that's not going to happen right away though.

Good luck and let me know what you find!

@pgrm

This comment has been minimized.

Copy link
Author

commented Jul 8, 2019

FYI - I've found this great code analyze: https://github.com/ykoksen/unused-task-warning - and I enabled the strict mode, which should also throw errors if tasks are passed down as variables, instead of being avaited directly, and it didn't find any issues (and yes, I was suspicious and added some sample code which it should find, and that worked) ....

I also checked manually the one request, which is usually preceding an issue like ...bind message supplies... - but I guess that was just because it's the most common request (~25-50% of all the requests) so having that as a predecessor is more likely than anything else.

I also checked that we don't use anywhere some Parallel.For or similar and Task.WhenAll or similar - and nothing. There are of course still things I might have missed ... I just wouldn't know what. In 2 weeks I should have some time, and I'll stare at the auto-prepare implementation for a moment.

@pgrm

This comment has been minimized.

Copy link
Author

commented Jul 24, 2019

Just had another occurrence of this. I noticed that the very same request (with one different parameter) was called before as well. So it seems that the counter for the used SQL query went up and the query was tried to be prepared automatically at the next request - but something during the preparation failed? - same old endpoint, which is responsible for half the HTTP requests and ~80% of the load (probably also ~80% of the SQL queries) - I know I'm repeating myself here, just trying to think for myself also looking at that postgres code, where / how it could happen


Update (just for writing this down)

It happened again, I've double checked all the resources tied to that issue, and all of them had an average load. (that is - CPU / Memory on the instance, CPU on DB and number of requests processed by the instance)

@roji

This comment has been minimized.

Copy link
Member

commented Jul 25, 2019

I'd love to have the time to dive into this, but at the moment that simply isn't possible, unless specific information pinpointing the issue can be found... Any help on this (or on analyzing the code) would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.