Read arbitrary trees of arrays of composite types, with specific type handlers for inner elements #3558

Emill · 2021-02-21T17:40:50Z

For npgsql/efcore.pg#1691, we need a way to efficiently read a column of type record[] which can potentially represent a unique structure for every query. It more or less deals with the same issue as unmapped composite types, but in this case the plain record oid is used.

Currently, a .NET object[][] is returned for PG record[], which is not good enough in case some element has a specific type mapping for example when using EF.

Note that an item in a record can contain another record, or array of records, of an arbitrary depth.

There are many approaches to deserialize this. Here are some options:

Some reflection-based approach that takes a Type as input and constructs the given object.

public class MyType {
    public int Thing { get; set; }
    public InnerType[] MoreThings { get; set; }
}

public class InnerType {
    public NpgsqlTimestamp Timestamp { get; set; }
    public int[] IntArray { get; set; }
}

MyType[] records = reader.GetFieldValue<MyType[]>(index);

I believe this was kind of supported in Npgsql 4 (but not inner types), but was dropped.

A communication API for moving the deserializer forward, step by step. Something like this:

// in NpgsqlDataReader:
public void StartDeserializeComplexType(int columnIndex);
public int StartDeserializeArray(); // returns the number of items in the array that follows
public int StartDeserializeRecord(); // returns the number of items in the record that follows
public T DeserializeElement<T>(); // uses the standard type handler for reading the item

The last three methods throw if the deserialization state doesn't match what the user expects.

We can then use it like this:

reader.StartDeserializeComplexType(index);
MyType[] records = new MyType[reader.StartDeserializeArray()];
for (var i = 0; i < records.Length; i++) {
    reader.StartDeserializeRecord();
    MyType record = new MyType();
    record.Thing = reader.DeserializeElement<int>();
    record.MoreThings = new InnerType[reader.StartDeserializeArray()];
    for (var j = 0; j < record.MoreThings.Length; j++) {
        reader.StartDeserializeRecord();
        InnerType inner = new InnerType();
        inner.Timestamp = reader.DeserializeElement<NpgsqlTimestamp>();
        inner.IntArray = reader.DeserializeElement<int[]>(); // can do this the easy way
        record.MoreThings[j] = inner;
    }
    records[i] = record;
}

Or let StartDeserializeComplexType return a new object having the methods above, to avoid cluttering NpgsqlDataReader, at the expense of an extra allocation (which however could be cached...).

Some visitor with callback API, so the user gets a callback when we enter and leave records and arrays etc.

While option 1 is probably easiest to use for normal users, it has performance drawbacks (due to reflection), as well as the potential unability to map record items, which are unnamed, to the correct property (since properties are unordered in C#, if not annotated to be laid out in a particular order). Also if the returned object tree is only for intermediate use, we waste memory.

Option 2 is in my opinion a better fit for how EF materialization works, as it dynamically generates .NET Expressions which will be compiled to a real function. The materialization code can just generate the (verbose) needed code that iterates through the record tree and builds whatever entity objects it wants directly, without any boxing or any creating a bunch of new .NET Types.

Also note that 1 can easily be created as a helper function using 2 if we would want that.

Comments?

The text was updated successfully, but these errors were encountered:

roji · 2021-02-21T22:15:55Z

Without (yet) discussing the specific methods above...

One general difficulty here, is that the record wire representation contains field types but no names; this raises the question of how to match CLR properties (which have no specific ordering) to PG record (which come back with a specific order). In other words, imagine a CLR type with two int properties, which we want to materialize from a PG record containing two ints - how would we get it right?

In that sense, composite seems like a better option compared to record, since the types have a preset order and Npgsql pre-loads all these composite definitions. So EF can just ask Npgsql to materialize the composite into a given CLR type, and it would happen recursively on its own (we already do that). It's true that right now we require the API consumer to map the CLR type to the PG composite in advance, but we could bring unmapped composites back in some way.

One last comment - since the goal here is the EF support, I would leave this part of the problem for later - there are probably going to quite a few challenges inside of EF Core before we ever reach the problem of materialization. Put another way, if we can make everything work with pre-mapped composites (where all the support already exists), that's already a pretty amazing thing, and we can solve that last step when we get there. Just my two cents.

Emill · 2021-02-21T23:32:36Z

It's true that this is just one of many tasks involving EF support for the array collection strategy. But anyway, I find it a bit hard how to use the pre-mapped composite types with EF, especially since they need to be added to the pg catalog first.

Imagine someone writes a query like this:

var blogs = context.Blogs.Select(b => new {
    b.Title,
    Posts = b.Posts.Select(p => new {
        p.Id,
        p.Text,
        Tags = p.Tags.Select(t => new {
            t.Id,
            t.TagName
        }).ToList()
    }).ToList()
}).ToList();

This will return a result set of rows containing two columns: Title (text), Posts (record[]). Each record contains first Id, Text followed by another record[] containing records of Id and TagName. We can't require the user to go ahead and create these two composite types (so that these are used instead of the generic record oid), just because they write this particular query in their application. Or do you mean that EF should itself go ahead and create the needed composite types in the pg catalog for each query?

Emill · 2021-02-21T23:35:44Z

Also note that there might be other ORMs than EF that might be interested in this, where it is easier to build this kind of support.

roji · 2021-02-22T09:25:23Z

But anyway, I find it a bit hard how to use the pre-mapped composite types with EF, especially since they need to be added to the pg catalog first.

Note that every PostgreSQL table has an implicit composite type which describes its rows; Npgsql doesn't load these "table composites" by default - since there could be a huge number of tables - but you can turn that on with Load Table Composites=true on the connection string. But this indeed only covers projecting out entities, as opposed to anonymous types as in your code sample above. I definitely don't think EF (or the user) should create composite types per-query - records do seem like the way to go.

I guess I'm proposing to defer the Npgsql ADO changes (how to read a record) until we're further along on the EF Core side - that would also give us a better idea of exactly which API from Npgsql would be best to support the EF side.

Also note that there might be other ORMs than EF that might be interested in this, where it is easier to build this kind of support.

That's true and I think we'll definitely end up building something like this. I just personally don't have the impression I have the entire picture of what this needs to look like, and at the point where we need to implement the actual EF shaper things would be much clearer.

But I'm definitely open to doing it your way if you feel strongly about it.

roji · 2021-02-22T11:56:28Z

OK you got me thinking about it despite what I wrote above :)

Another way to go for exposing things in Npgsql, is to programs to request a strongly-typed ValueTuple from RecordHandler. This approach would only be usable if the API consumer knows in advance the order of the different fields, which is indeed the case in EF (since it generated the querying SQL).

One possible disadvantage of a strongly-typed ValueTuple is the creation of a lot of CLR generic types because of different parameter type combinations. As an alternative (or possibly in addition?), we could expose something similar to a DbDataReader implementation, which would allow consumers to generically (or non-generically) access a field by index only (but not by name).

Note that all of the above only seems necessary to avoid boxing when reading value fields on records; otherwise the current object[] approach should work fine (that's another reason why we can just go ahead with the EF Core work and not worry about it too much right away). I don't think there's a need to go beyond records and provide a general-purpose visitation/deserialization API; unless I'm mistaken, consumers (such as EF) are again expected to know in advance exactly which fields and types they're reading. In the EF shaper, you'd effectively be generating code that walks over the specific tree structure read from Npgsql, knowing exactly what to expect.

Of course, this whole approach has the disadvantage that Npgsql returns the data in one form (with records), which must now be transformed/materialized into the user's POCO entity types. But any other approach - where Npgsql just gives back the POCOs directly would require establishing the mapping with composites in advance.

Anyway, just some random thoughts :)

Emill · 2021-02-22T12:24:50Z

Thanks for the thinking!

Another way to go for exposing things in Npgsql, is to programs to request a strongly-typed ValueTuple from RecordHandler. This approach would only be usable if the API consumer knows in advance the order of the different fields, which is indeed the case in EF (since it generated the querying SQL).

I was thinking of something like this too, but was a bit surprised that C# doesn't allow arbitrary number of generic arguments, as C++ does. Therefore ValueTuple is limited in number of items. It seems this is worked around however by letting the last item consist of another, nested ValueTuple. So yes this can indeed be a good idea that won't require any new API functions. Note that even though the ValueTuple is a struct (non-heap allocated), we will generally ask for a ValueTuple[], which will be allocated on the heap. So with this approach we will still waste some memory that is to be thrown away immediately. I have no idea though how important this is for performance, if at all. EF will create a bunch of objects anyway the user probably will dispose quite quickly.

One possible disadvantage of a strongly-typed ValueTuple is the creation of a lot of CLR generic types because of different parameter type combinations. As an alternative (or possibly in addition?), we could expose something similar to a DbDataReader implementation, which would allow consumers to generically (or non-generically) access a field by index only (but not by name).

You mean something like recordReader = dbDataReader.GetFieldValue<RecordReader>(index), and then use recordReader.GetFieldValue<int>(3) to access the inner element? Could work but might require some extra work to identify all the byte offsets for the fields to allow random access reads. We need to also keep track that the next row has not been fetched yet etc.

Note that all of the above only seems necessary to avoid boxing when reading value fields on records; otherwise the current object[] approach should work fine (that's another reason why we can just go ahead with the EF Core work and not worry about it too much right away).

Just returning an object[] won't work since then each element will be decoded into the default CLR type. A user could possibly map a column in EF to use a different data type, that's why EF needs to be able to tell Npgsql what type there should be in every field.

roji · 2021-02-22T13:49:06Z

I was thinking of something like this too, but was a bit surprised that C# doesn't allow arbitrary number of generic arguments, as C++ does. [...]

Yeah, this is a known detail of ValueTuple which shouldn't block us (but will make things slightly more complicated to implement).

Note that even though the ValueTuple is a struct (non-heap allocated), we will generally ask for a ValueTuple[], which will be allocated on the heap. So with this approach we will still waste some memory that is to be thrown away immediately.

That's a good point. Ideally we wouldn't waste memory like this, so a visitor/deserialization API could indeed make sense to eliminate those heap allocations. And if we do that, it may as well expose the right API for record etc.

You mean something like recordReader = dbDataReader.GetFieldValue(index), and then use recordReader.GetFieldValue(3) to access the inner element? Could work but might require some extra work to identify all the byte offsets for the fields to allow random access reads. We need to also keep track that the next row has not been fetched yet etc.

Yep, that's right (all of it). It's not unlike what we already do in NpgsqlDataReader (note also the importance of CommandBehavior.SequentialAccess!).

Just returning an object[] won't work since then each element will be decoded into the default CLR type. A user could possibly map a column in EF to use a different data type, that's why EF needs to be able to tell Npgsql what type there should be in every field.

That's true, though I'd happy start with a solution that starts working for the default CLR type :)

Emill · 2021-03-17T00:03:02Z

Ok so I put up the initial implementation here for the inner DbDataReader thing:
Emill@194187d
So far it only works in non-sequential mode.

roji · 2021-03-17T11:16:08Z

@emil can you please submit this as a PR so it's easier to review/comment?

Emill · 2021-03-17T11:42:45Z

Posted at #3604.

roji added this to the vNext milestone Feb 22, 2021

roji assigned Emill Feb 22, 2021

roji added the enhancement label Feb 22, 2021

Emill mentioned this issue Mar 9, 2021

Translate queries where a row includes a collection to use arrays instead of being implemented through joins npgsql/efcore.pg#1691

Open

Emill mentioned this issue Mar 17, 2021

Initial implementation of inner DbDataReader #3604

Merged

roji closed this as completed in #3604 Apr 14, 2021

roji linked a pull request Apr 14, 2021 that will close this issue

Increase UniqueRowId also for ReadAsync #3664

Merged

roji mentioned this issue Jul 17, 2022

Using json_agg to avoid cartesian explosions like Hasura and Postgraphile npgsql/efcore.pg#2434

Closed

This was referenced Nov 19, 2023

Document unmapped composite support npgsql/doc#314

Open

Mechanism for writing unmapped composites #5415

Open

More than one PostgreSQL type was found with the name dummy, please specify a full name including schema. #4557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read arbitrary trees of arrays of composite types, with specific type handlers for inner elements #3558

Read arbitrary trees of arrays of composite types, with specific type handlers for inner elements #3558

Emill commented Feb 21, 2021

roji commented Feb 21, 2021

Emill commented Feb 21, 2021

Emill commented Feb 21, 2021

roji commented Feb 22, 2021

roji commented Feb 22, 2021

Emill commented Feb 22, 2021

roji commented Feb 22, 2021

Emill commented Mar 17, 2021

roji commented Mar 17, 2021

Emill commented Mar 17, 2021