Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large schema mismatches in RecordBatch to StructArray conversion are hard to debug #187

Open
johanpel opened this issue Jun 11, 2024 · 1 comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested

Comments

@johanpel
Copy link
Collaborator

johanpel commented Jun 11, 2024

When turning an Arrow RecordBatch into a Narrow StructArray, if there is a schema mismatch, there is currently a panic, e.g.:

PrimitiveArray expected data type Int64 got Int32

I believe this message originates from Arrow, not Narrow.

With structs that have many (nested) fields, it is difficult to figure out which field exactly is wrongly typed, since this message only contains the type and not the field name.

It would be great if the panic message would contain the field name.

Perhaps this is not trivial to implement, so alternatively it would be useful to be able to call something like fn schema() -> arrow_schema::Schema on the Narrow StructArray such that users could perform run-time schema validation before trying to convert. I haven't done a thorough search, but from a quick glance, I don't see anything that would specifically report schema mismatches in a friendlier way in arrow_schema, yet I can imagine adding such functionality. This could then be leveraged with the proposed schema function (and prevent panics).

@mbrobbel
Copy link
Owner

These panics happen when downcasting the Array trait objects. Ideally narrow would add support for TryFrom implementations that validate the schema.

For now you can do something like this:

use arrow_array::RecordBatch;
use narrow::{array::StructArray, arrow::StructArrayTypeFields, buffer::VecBuffer, ArrayType};
use pretty_assertions::assert_eq;

#[derive(ArrayType, Default, Clone, Debug, PartialEq, Eq)]
struct Foo {
    a: bool,
    b: u32,
    c: Option<String>,
}

#[derive(ArrayType, Default, Clone, Debug, PartialEq, Eq)]
struct Bar {
    a: bool,
    b: u64,
    c: Option<bool>,
}

fn main() {
    let array = [Bar {
        a: false,
        b: 1,
        c: None,
    }]
    .into_iter()
    .collect::<StructArray<Bar>>();
    let record_batch = RecordBatch::from(array);

    assert_eq!(
        record_batch.schema().fields(),
        &<FooArray<VecBuffer> as StructArrayTypeFields>::fields()
    );
}

Which results in:

assertion failed: `(left == right)`

Diff < left / right > :
 [
     Field {
         name: "a",
         data_type: Boolean,
         nullable: false,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {},
     },
     Field {
         name: "b",
<        data_type: UInt64,
>        data_type: UInt32,
         nullable: false,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {},
     },
     Field {
         name: "c",
<        data_type: Boolean,
>        data_type: Utf8,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {},
     },
 ]

@mbrobbel mbrobbel added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers question Further information is requested labels Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants