Skip to content

Autocast List Columns? #268

@dougbrn

Description

@dougbrn

Feature request
Have been playing with this idea for the last day or so, where essentially; if we think that nested columns are a more supported interface than lists, why not just cast them directly?

The main case I'm thinking of is on read, with the option to turn it off via a kwarg (cast_lists). The obvious issue being that if there are multiple lists that are part of the same structure, our casting would produce multiple separate nested columns, but it may be easier from an API perspective for users to combine those nested columns into a single column (potentially using a new function called combine_nested) rather than using the nest_lists interface, e.g.

import nested_pandas as npd
nf = npd.read_parquet("input.parquet") # has list-columns; flux, mjd, band

# nf has 3 nested structures; flux (with sub-column flux), mjd (with subcolumn mjd), and band (with subcolumn band)

# create a single nested structure
nf = nf.combine_nested(["flux", "mjd","band"], name="lc")
# now nf has 1 nested structure; lc (with sub-columns flux, mjd, and band)

The strength of this is that it's more clear to users that flux, mjd and band are related because the nested representation will provide the dimensionality of the lists, so you can more easily see what fields are equal length. With lists, you need to either go off name context or check the lengths yourself.

Another strength is that these columns are usable right away in the API for filtering, counting, reduce, etc. You could even avoid the combination step if you just wanted to use the columns for reduce right after loading.

The downsides I see are that there's some overhead to casting this, and generating the offsets with potential duplication of offsets in cases like above.

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions