In file format 2.1 there was a change to how fields are mapped to columns. In 2.0 non-leaf fields (list fields and struct fields) had a proper column in the file. In file versions >= 2.1 these fields no longer have columns in the file. Instead their validity information is folded into the repetition and definition levels.
As an example, if you have some schema like...
{
x: i32,
y: [f32],
z: {
a: i32
}
}
The lance schema will have:
x: field-id 0
y: field-id 1
y.item: field-id 2
z: field-id 3
z.a: field-id 4
The 2.0 data file metadata will be:
fields: 0, 1, 2, 3, 4
column_indices: 0, 1, 2, 3, 4
The 2.1 data file metadata will be:
fields: 0, 1, 2, 3, 4
column_indices: 0, -1, 1, -1, 2
This should be documented in a migration guide for 5.0.0 (since that is where the default changed to 2.1).
Note: this will not affect most users. Most users do not construct DataFile messages directly. Some advanced users however (e.g. those creating operations by hand for Dataset.commit) may be impacted.
In file format 2.1 there was a change to how fields are mapped to columns. In 2.0 non-leaf fields (list fields and struct fields) had a proper column in the file. In file versions >= 2.1 these fields no longer have columns in the file. Instead their validity information is folded into the repetition and definition levels.
As an example, if you have some schema like...
The lance schema will have:
The 2.0 data file metadata will be:
The 2.1 data file metadata will be:
This should be documented in a migration guide for 5.0.0 (since that is where the default changed to 2.1).
Note: this will not affect most users. Most users do not construct
DataFilemessages directly. Some advanced users however (e.g. those creating operations by hand forDataset.commit) may be impacted.