-
Notifications
You must be signed in to change notification settings - Fork 848
Investigate one-pass parameter writing #1727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This was partially implemented in https://github.com/roji/Npgsql/tree/one-pass, but the perf improvement were less than hoped. Perhaps revisit again in the future. |
Opening this to revisit. Have just implemented support for System.Text.Json, where not having this is particularly sad. The new JSON serializer can write UTF8 directly to our binary buffer; however, we currently do a first pass to calculate length, which obliges us to preserialize just to get the length... With one-pass writing we could leave 4 empty bytes, serialize, and then go back and write the length. The question is of course what happens if the buffer isn't large enough to serialize into; we could fall back to two-pass writing (which would mean leaving both code paths), allocating (or renting) a bigger buffer, or switching to pipelines (which would be very similar). |
A high-level idea is the following, for example for writing a First leave space for length markers. Start using |
@Emill yeah, this is exactly what I had in mind (and even half-implemented at some point a long time ago). I agree this should be pretty far down on our list. We could maybe do a hacked-up experiment to measure what a proper implementation could provide in terms of perf... |
As a reference, note that the PostgreSQL backend seems to have the philosophy that "memcpy and allocations are cheap" rather than having complex code that traverses the data multiple times to avoid copying and memory allocations. For example, consider a jsonb value that is sent from the backend as a DataRow message. The data is copied multiple times:
So in total three memcpys. On the other hand, Npgsql usually does a very good job to write directly to the buffer, avoiding all the memcpys. I think this is very cool and probably helps performance quite a lot, especially pressure on the GC. The downside is of course that the implementation gets a bit more complex. |
Yeah, it's interesting... I've always tried to keep Npgsql as "zero-copy" as possible - there still are some specific areas where we can improve things (especially around JSON IIRC). But note that one-pass isn't really about reducing allocations - just about reducing traversal of the same data more than once. I really don't have a good grasp of the perf weight that could have though. |
Expanding the length cache to also be able to contain the serialized json string would probably be a not that intrusive change that helps avoiding multi-traversion... Just that the name "length cache" should then maybe be changed to reflect the new contents, which would break the plugin api :/ |
@Emill the plugin API hasn't really been public in the same sense as the rest of Npgsql - I also moved all the necessary types/APIs into an Internal namespace to signal that. Since plugins need to access some of the guts of Npgsql (e.g. the type handler type system, the read/write buffers), and we definitely don't want to freeze those APIs. Seen another way, the plugin system was first introduced to allow us to separate out optional stuff out (and avoid adding unwanted dependencies to the core) - it's less meant for anyone out there to start coding against... Of course anyone is welcome to do that, but they need to be aware that they don't get the same API stability guarantees etc. So I wouldn't hesitate to change the length cache or anything like that. We've done much more breaking changes in the past and nobody's complained. Re JSON specifically, we already do cached the serialized JSON string on ConvertedValue, so again for top-level values that's already working well. Once again, if we tried to tackle one-pass writing, we could even have System.Text.Json try to serialize directly to binary (with Utf8JsonWriter) - skipping the intermediate string altogether - which could be significant. |
Would be great to see this a addressed in the future as it is preventing me from using pgsql for my application which has a high performance requirement. In my case, I have hundreds of thousands of documents that have to be put into a table. All the extra back and forth really hurts perf because of excessive garbage and GC. |
@martinmine do you have an indication that this issue specifically is problematic for your scenario? This is about an internal implementation detail of Npgsql, where the parameter list is internally traversed twice instead of once; previous explorations didn't show this to be significant. This certainly has nothing to do with extra "back and forths" if that's meant in the sense of database round-trips. |
@roji Sorry - bad wording from my side :) by back and forth I meant how for example a JSON value is copied multiple times, as @Emill points out earlier in this thread. Nothing to do with round-trips. For my use-case I was investigating excessive GC while running many INSERT-statements with a jsonb-field, and as I discovered how JSON objects are serialized to a string in JsonHandler.cs before written to a buffer I noticed the comments in the source code that indicates this is a problem you guys are aware of. Please let me know if this is not the appropriate thread to discuss this and I will create a separate issue for this. |
Am assuming you're referring to this comment. /cc @NinoFloris, this is a good example of something that could probably be more efficient (and simple) with Pipelines - though more investigation would be needed. |
Yep - that is correct. I managed to mitigate the memory consumption with a few workarounds and some minor modifications to that class. First workaround was to set The other problem was that my pooled memory stream would still call |
Am a bit confused about this... JsonHandler.ValidateAndGetLengthCustom already stores the JsonDocument as an array, on parameter.ConvertedValue, and does not later re-convert it if set. The exception to that is if parameter is null, which occurs when the JsonDocument is nested inside another type (e.g. array of jsonb); but in the normal case of a simple JsonDocument parameter, serialization to byte array should only happen once. Can you be a bit more specific on what optimization you're thinking of here (and maybesubmit a quick PR)?
It sounds like what you want is to pass a byte array (or ReadOnlyMemory/ReadOnlySequence) as a parameter value instead of JsonDocument. Generally allowing users to provide raw binary data for sending is tracked by #4052 (see the earlier #1891 which proposed this specifically for JSON). While I think this is a good idea, it seems to be useful mainly in cases where the same JSON document is being sent multiple times, and so can be serialized to binary once in user code (rather than over and over within Npgsql). Can you confirm that's your scenario? |
The problem is I have many large JSON documents I want to process. Because of how the helper method
I am not sending the same document multiple times. I am sending hundreds of thousands of documents to the database, and these can be in the range 500kb-1.5mb large. I want this to be as fast as possible and the memory consumption was the first thing that caught my eye when I tried using this. |
This is somewhat improved after merging #5123 as we'll initialize the stream with the raw text length as a rough capacity estimate. We can improve things a lot further once we add support for SizeKind.Unknown being returned from converters. After doing so returning this from a converter means we'll buffer the entire write, take its length for postgres and store the entire buffer for writing on the parameter. This will completely remove the double serialization step. Additionally we can add support for ROSeq or byte[] for the even more perf minded crowd. Though ideally our one pass writing should be enough for most (all?) cases. |
When writing protocol messages - and especially parameters within the Bind message -Npgsql currently performs two passes: one to validate all parameters and calculate the total byte length of the message, and another for actual writing.
Length calculation is needed because the PostgreSQL protocol requires the length up-front, but messages may not fit in the write buffer. Validation happens up-front because if we start writing anything before everything is validated, a validation error would leave us in an unrecoverable state where some protocol messages have been sent to PostgreSQL, and the connection would have to be broken.
It can be argued that validation errors are quite rare, and detecting them isn't extremely important - having a broken connection in case of a validation error could be acceptable. Also, in the vast majority of cases messages probably do fit in the write buffer.
So, we can implement a fast-path where the message is written in one pass, unless it doesn't fit in the buffer; if the latter scenario occurs, we switch back to a two-pass system. The performance impact of this needs to be studied, since the implementation of the two different paths is non-trivial.
The text was updated successfully, but these errors were encountered: