Skip to content

Document changed expectations in DataFile metadata for 2.1 format #6411

@westonpace

Description

@westonpace

In file format 2.1 there was a change to how fields are mapped to columns. In 2.0 non-leaf fields (list fields and struct fields) had a proper column in the file. In file versions >= 2.1 these fields no longer have columns in the file. Instead their validity information is folded into the repetition and definition levels.

As an example, if you have some schema like...

{
  x: i32,
  y: [f32],
  z: {
    a: i32
  }
}

The lance schema will have:

x: field-id 0
y: field-id 1
y.item: field-id 2
z: field-id 3
z.a: field-id 4

The 2.0 data file metadata will be:

fields: 0, 1, 2, 3, 4
column_indices: 0, 1, 2, 3, 4

The 2.1 data file metadata will be:

fields: 0, 1, 2, 3, 4
column_indices: 0, -1, 1, -1, 2

This should be documented in a migration guide for 5.0.0 (since that is where the default changed to 2.1).

Note: this will not affect most users. Most users do not construct DataFile messages directly. Some advanced users however (e.g. those creating operations by hand for Dataset.commit) may be impacted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions