Precasting of series before final dataframe is created #3805

joshuataylor · 2022-06-25T04:14:22Z

Describe your feature request

With Snowflake, we receive Arrow Streaming IPC files, which we can parse.

However, they send us timestamp data in a Struct, which we have to process. This struct contains two fields, an epoch in i64 seconds, and a fraction (nanoseconds).

We also receive other parts in kind of a weird format (int32 etc), so we'll need to do it for a couple of other types as well.

I think it's because they adopted Arrow early and before certain columns were finished in Arrow? (Just guessing)

I would like to have something like this:

IpcStreamReader::new(stream_reader)
    .with_casting(casting_functions)
    .finish()
    .unwrap();

I'm not sure what the best way to pass the casting functions would be, and I know this might add a lot of complexity so would like ideas around this.

ritchie46 · 2022-06-25T08:51:33Z

Why not casting after the reading? There is no difference in the amount of compute.

jorgecarleitao · 2022-06-25T08:53:59Z

curious, is that metadata in a Rust's Struct(e.g. somewhere from the file's metadata), or in a Arrow' StructArray (i.e. one value per row).

joshuataylor · 2022-06-25T09:04:29Z

Sorry, I meant in an Arrow DataType Struct. Is it okay to do it after reading, it should be fast enough, right?

You can find base64 encoded examples here:
https://github.com/joshuataylor/snowflake_arrow/blob/main/priv/testing/base64/SF_TIMESTAMP.arrow
https://github.com/joshuataylor/snowflake_arrow/blob/main/priv/testing/base64/SF_TIMESTAMP_LTZ.arrow
https://github.com/joshuataylor/snowflake_arrow/blob/main/priv/testing/base64/SF_TIMESTAMP_NTZ.arrow

jorgecarleitao · 2022-06-25T09:11:05Z

It should be equally fast - do note that that has no equivalent to Arrow's native types. Arrow logical timestamp have a single offset/tz stored in the DataType shared among all rows. A StructArray is likely storing an offset/tz per row.

joshuataylor · 2022-06-25T11:41:57Z

Is there a recommended way to do so?

The reason I'm asking is because for https://github.com/elixir-nx/explorer we want to remap arbitrary types the user passes in, which we could do via cast (which works great).

The problem with Snowflake IPC Streams is like you mentioned above, where it becomes a lot trickier, so my idea is to:

Get the schema before we finish the dataframe.
Pass the finished dataframe & schema into a mapping function.
Create a hashmap of the metadata, so we can get it out easier (cleanest way I could find to do this)
Use an iter to loop the columns and find those which are structs
4a. For structs, check the "logicalType" from that fields metadata (which we get from the schema) is "TIMESTAMP_NTZ" | "TIMESTAMP_LTZ" | "TIMESTAMP_TZ", we do slight variations for each (with regards to timezones, etc).
4b. Snowflake passes you the epoch and fraction out, which you can get from the series you are currently iterating over.

    //                     let fields = series.struct_().unwrap().fields();
    //                     let epoch_series = fields.get(0).unwrap();
    //                     let fraction_series = fields.get(1).unwrap();

4c. Create a new datetime using chrono NaiveDateTime::from_timestamp(epoch, fraction as u32)
5. After this, readd it to the df or just make a new DF depending on performance.

Of course, any other types that need custom casting we do that in step 4.

My testing shows this is very fast (thanks to the trifecta of arrow2, polars, rust 👍 ) Snowflake implementation.

Would it be better to use apply, or is that advised against for changing the type?

We'd also be interested in passing in custom mapping somehow from Elixir as well, but that needs much more thought.

Sorry for my questions, I'd be happy to not make this an issue and instead a discussion board somewhere. 🙇

joshuataylor added the feature label Jun 25, 2022

stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precasting of series before final dataframe is created #3805

Precasting of series before final dataframe is created #3805

joshuataylor commented Jun 25, 2022

ritchie46 commented Jun 25, 2022

jorgecarleitao commented Jun 25, 2022

joshuataylor commented Jun 25, 2022 •

edited

Loading

jorgecarleitao commented Jun 25, 2022

joshuataylor commented Jun 25, 2022

Precasting of series before final dataframe is created #3805

Precasting of series before final dataframe is created #3805

Comments

joshuataylor commented Jun 25, 2022

Describe your feature request

ritchie46 commented Jun 25, 2022

jorgecarleitao commented Jun 25, 2022

joshuataylor commented Jun 25, 2022 • edited Loading

jorgecarleitao commented Jun 25, 2022

joshuataylor commented Jun 25, 2022

joshuataylor commented Jun 25, 2022 •

edited

Loading