Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested datatypes in from_repr #15842

Open
wolliq opened this issue Apr 23, 2024 · 6 comments
Open

Support nested datatypes in from_repr #15842

wolliq opened this issue Apr 23, 2024 · 6 comments
Labels
A-other Area: not covered by other areas accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@wolliq
Copy link

wolliq commented Apr 23, 2024

Description

In many ML/NLP use cases it's useful to have the reading from_repr feature supporting list type so that reading from a feature store where numerical representation are stored, e.g. embeddings vectors for unit testing.
Today if we run

        import polars as pl
        dfp = pl.from_repr("""
shape: (1, 1)
┌──────────────────────────────────────────────────┐
│ segment_ids                                      │
│ ---                                              │
│ list[i32]                                        │
╞══════════════════════════════════════════════════╡
│ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0] │
└──────────────────────────────────────────────────┘
        """)

we have

...
raise NotImplementedError(msg)
NotImplementedError: `from_repr` does not support data type 'List'

Thanks

@wolliq wolliq added the enhancement New feature or an improvement of an existing feature label Apr 23, 2024
@stinodego stinodego changed the title NotImplementedError: from_repr does not support data type 'List' Support nested datatypes in from_repr Apr 24, 2024
@stinodego stinodego added accepted Ready for implementation A-other Area: not covered by other areas labels Apr 24, 2024
@stinodego
Copy link
Member

Thanks for the issue. This would definitely be good to support.

@tharunsuresh-code
Copy link
Contributor

tharunsuresh-code commented May 14, 2024

Hey, can I take this up? I assume I would need to support just polars.datatypes.FLOAT_DTYPES and polars.datatypes.INTEGER_DTYPES inside the List right?

I have made a draft pull request, would appreciate any comments :) If you think I am in the right direction, I can work on test cases and other functionalities associated with this feature.

@stinodego
Copy link
Member

Hey, can I take this up? I assume I would need to support just polars.datatypes.FLOAT_DTYPES and polars.datatypes.INTEGER_DTYPES inside the List right?

Sure! Lists can contain anything though (also strings, decimals, ...). So it's not just constrained to floats/integers.

@tharunsuresh-code
Copy link
Contributor

tharunsuresh-code commented May 16, 2024

Got it, I'm working on it. I have doubt regarding wrap around for string representation of polars dataframe, the column data is wrapping around as follows:

shape: (2, 3)
┌─────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
│ f                               ┆ g                               ┆ h                               │
│ ---                             ┆ ---                             ┆ ---                             │
│ list[date]                      ┆ list[time]                      ┆ list[datetime[ns]]              │
╞═════════════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
│ [2022-07-05, 2023-02-05, 2023-… ┆ [00:00:00.000001, 12:30:45, 23… ┆ [2022-07-05 10:30:45.004560, 2… │
│ [2022-07-05, 2023-02-05, 2023-… ┆ [00:00:00.000001, 12:30:45, 23… ┆ [2022-07-05 10:30:45.004560, 2… │
└─────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Due to this, the data is truncated, any suggestion on how I can handle this?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 16, 2024

Due to this, the data is truncated, any suggestion on how I can handle this?

The reasonable thing to do is load only the whole/valid data; truncated columns (when a frame has more cols than can be displayed) are similarly dropped. There is, after all, no way (at all) to reconstruct the truncated values, so...

@tharunsuresh-code
Copy link
Contributor

tharunsuresh-code commented May 17, 2024

Got it, thanks! I have raised a pull request, could you please review and let me know if there are any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-other Area: not covered by other areas accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

4 participants