Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precasting of series before final dataframe is created #3805

Open
joshuataylor opened this issue Jun 25, 2022 · 5 comments
Open

Precasting of series before final dataframe is created #3805

joshuataylor opened this issue Jun 25, 2022 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@joshuataylor
Copy link
Contributor

Describe your feature request

With Snowflake, we receive Arrow Streaming IPC files, which we can parse.

However, they send us timestamp data in a Struct, which we have to process. This struct contains two fields, an epoch in i64 seconds, and a fraction (nanoseconds).

We also receive other parts in kind of a weird format (int32 etc), so we'll need to do it for a couple of other types as well.

I think it's because they adopted Arrow early and before certain columns were finished in Arrow? (Just guessing)

I would like to have something like this:

IpcStreamReader::new(stream_reader)
    .with_casting(casting_functions)
    .finish()
    .unwrap();

I'm not sure what the best way to pass the casting functions would be, and I know this might add a lot of complexity so would like ideas around this.

@ritchie46
Copy link
Member

Why not casting after the reading? There is no difference in the amount of compute.

@jorgecarleitao
Copy link
Collaborator

curious, is that metadata in a Rust's Struct(e.g. somewhere from the file's metadata), or in a Arrow' StructArray (i.e. one value per row).

@joshuataylor
Copy link
Contributor Author

joshuataylor commented Jun 25, 2022

@jorgecarleitao
Copy link
Collaborator

It should be equally fast - do note that that has no equivalent to Arrow's native types. Arrow logical timestamp have a single offset/tz stored in the DataType shared among all rows. A StructArray is likely storing an offset/tz per row.

@joshuataylor
Copy link
Contributor Author

Is there a recommended way to do so?

The reason I'm asking is because for https://github.com/elixir-nx/explorer we want to remap arbitrary types the user passes in, which we could do via cast (which works great).

The problem with Snowflake IPC Streams is like you mentioned above, where it becomes a lot trickier, so my idea is to:

  1. Get the schema before we finish the dataframe.
  2. Pass the finished dataframe & schema into a mapping function.
  3. Create a hashmap of the metadata, so we can get it out easier (cleanest way I could find to do this)
  4. Use an iter to loop the columns and find those which are structs
    4a. For structs, check the "logicalType" from that fields metadata (which we get from the schema) is "TIMESTAMP_NTZ" | "TIMESTAMP_LTZ" | "TIMESTAMP_TZ", we do slight variations for each (with regards to timezones, etc).
    4b. Snowflake passes you the epoch and fraction out, which you can get from the series you are currently iterating over.
    //                     let fields = series.struct_().unwrap().fields();
    //                     let epoch_series = fields.get(0).unwrap();
    //                     let fraction_series = fields.get(1).unwrap();

4c. Create a new datetime using chrono NaiveDateTime::from_timestamp(epoch, fraction as u32)
5. After this, readd it to the df or just make a new DF depending on performance.

Of course, any other types that need custom casting we do that in step 4.

My testing shows this is very fast (thanks to the trifecta of arrow2, polars, rust 👍 ) Snowflake implementation.

Would it be better to use apply, or is that advised against for changing the type?

We'd also be interested in passing in custom mapping somehow from Elixir as well, but that needs much more thought.

Sorry for my questions, I'd be happy to not make this an issue and instead a discussion board somewhere. 🙇

@stinodego stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants