Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.unnest_all() #12353

Open
cmdlineluser opened this issue Nov 9, 2023 · 9 comments
Open

.unnest_all() #12353

cmdlineluser opened this issue Nov 9, 2023 · 9 comments
Labels
A-dtype-struct Area: struct data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@cmdlineluser
Copy link
Contributor

Description

Requests for this functionality (or a subset of) exist across quite a few issues (and several Stack Overflow questions):

unnest_all has cropped up a few times, so I've just chosen that name as a placeholder.

The basic use case is to allow:

df = pl.DataFrame({
    "x": [{"foo":{"a": 1, "b": 2}}, {"foo":{"a": 3, "b": 4}}],
    "y": [{"bar":{"a": 5, "b": 6}}, {"bar":{"a": 7, "b": 8}}]
})

df.pipe(unnest_all)

# shape: (2, 4)
# ┌─────────┬─────────┬─────────┬─────────┐
# │ x.foo.a ┆ x.foo.b ┆ y.bar.a ┆ y.bar.b │
# │ ---     ┆ ---     ┆ ---     ┆ ---     │
# │ i64     ┆ i64     ┆ i64     ┆ i64     │
# ╞═════════╪═════════╪═════════╪═════════╡
# │ 1       ┆ 2       ┆ 5       ┆ 6       │
# │ 3       ┆ 4       ┆ 7       ┆ 8       │
# └─────────┴─────────┴─────────┴─────────┘

My latest attempt at a Python helper for this is to walk the schema to build the expressions:

def _unnest_all(schema, separator):
    def _unnest(schema, path=[]):
        for name, dtype in schema.items():
            base_type = dtype.base_type()
            
            if base_type == pl.Struct:
                yield from _unnest(dtype.to_schema(), path + [name])
            else:
                yield path + [name], dtype
                
    for (col, *fields), dtype in _unnest(schema):
        expr = pl.col(col)
        
        for field in fields:
            expr = expr.struct[field]
            
        if col == "":
            name = separator.join(fields)
        else:
            name = separator.join([col] + fields)
        
        yield expr.alias(name)
        
def unnest_all(df, separator="."): 
    return df.select(_unnest_all(df.schema, separator))

However, I think the real benefit of this functionality (and the reason for this issue) is that it allows Polars to be used for interactively exploring nested data.

an interesting example, polars expressions:

pl.debug = lambda self: pl.select(pl.lit(self.meta.write_json()).str.json_extract().alias("")).pipe(unnest_all)

pl.debug((pl.col("a") + pl.col("b")).over("c", "d"))

# shape: (1, 5)
# ┌────────────────────────────────────────┬───────────────────────────────┬─────────────────────────────────────────┬─────────────────────┬─────────────────────┐
# │ Window.function.BinaryExpr.left.Column ┆ Window.function.BinaryExpr.op ┆ Window.function.BinaryExpr.right.Column ┆ Window.partition_by ┆ Window.options.Over │
# │ ---                                    ┆ ---                           ┆ ---                                     ┆ ---                 ┆ ---                 │
# │ str                                    ┆ str                           ┆ str                                     ┆ list[struct[1]]     ┆ str                 │
# ╞════════════════════════════════════════╪═══════════════════════════════╪═════════════════════════════════════════╪═════════════════════╪═════════════════════╡
# │ a                                      ┆ Plus                          ┆ b                                       ┆ [{"c"}, {"d"}]      ┆ GroupsToRows        │
# └────────────────────────────────────────┴───────────────────────────────┴─────────────────────────────────────────┴─────────────────────┴─────────────────────┘

Using Polars to load "JSON" data in the REPL and interactively explore it with .unnest_all() and .explode() is rather nice.

A proper implementation of this would be super useful.

@cmdlineluser cmdlineluser added the enhancement New feature or an improvement of an existing feature label Nov 9, 2023
@itay747
Copy link

itay747 commented Dec 2, 2023

A high performance port of pd.json_normalize would be really useful. There isn't any Python library that actually does json normalization particularly well, including the pandas one, because it falls short of handling the common case of arbitrarily deeply nested (k,v) being a str key and a v -> Iterable[dict[str, str | float | int]] type. In the pandas case, the "normalized" result will be emit v, which really falls short of what is possible/expected.

There should be specific logic here that does heuristic inspection on v to avoid doing the flatten-dict's approach of using integers as keys, which results in a crude a.0.id key in the flattened dictionary. So this could be something like

  1. Find a .*id.* key in the list[dict[str, Any] and then
  2. Verify all entries in the list have that key present, with unique and atomic values.
  3. Failing that, default back to flatten-dict's approach.

Perhaps this behaviour could be customizable with a parameter to the eventual pl.normalize_json function that accepts a regex for dealing with this, but let's limit it to just this parameter. I would want to avoid the pandas approach of accepting a meta/meta_prefix/record_prefix argument - it's simply too verbose in practice for deeply nested dictionaries, not to mention makes the assumption that these keys even exist, in the context of a function that is specifically made for making deeply nested JSON objects manageable.

Oh yeah, and above all: This function should follow the standard source: str | Path | IOBase | bytes function ParamSpec in line with other polars.io.read_* ops.

@Wainberg
Copy link
Contributor

Wainberg commented Jan 2, 2024

This seems really useful! How about just df.unnest() with no arguments, rather than df.unnest_all()?

@deanm0000 deanm0000 added needs triage Awaiting prioritization by a maintainer A-dtype-struct Area: struct data type labels Jan 22, 2024
@niccolopetti
Copy link

Just had to use this on my project, I hope to see it merged soon

@cmdlineluser
Copy link
Contributor Author

Yeah, I just took the name to use as a placeholder.

DuckDB seems to have a recursive parameter for it.

duckdb.sql("""
from df 
select unnest(x), unnest(y)
""")

# ┌────────────────────────────┬────────────────────────────┐
# │            foo             │            bar             │
# │ struct(a bigint, b bigint) │ struct(a bigint, b bigint) │
# ├────────────────────────────┼────────────────────────────┤
# │ {'a': 1, 'b': 2}           │ {'a': 5, 'b': 6}           │
# │ {'a': 3, 'b': 4}           │ {'a': 7, 'b': 8}           │
# └────────────────────────────┴────────────────────────────┘
duckdb.sql("""
from df 
select unnest(x, recursive := true), unnest(y, recursive := true)
""")

# ┌───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │
# │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┤
# │     1 │     2 │     5 │     6 │
# │     3 │     4 │     7 │     8 │
# └───────┴───────┴───────┴───────┘

(although it doesn't seem possible to keep the "path")

@Wainberg
Copy link
Contributor

Worth noting that DuckDB's recursive is actually a mix of polars's unnest and polars's list.explode:

-- unnesting a list of lists recursively, generating 5 rows (1, 2, 3, 4, 5)
SELECT unnest([[1, 2, 3], [4, 5]], recursive := true);
-- unnesting a list of structs recursively, generating two rows of two columns (a, b)
SELECT unnest([{'a': 42, 'b': 84}, {'a': 100, 'b': NULL}], recursive := true);
-- unnesting a struct, generating two columns (a, b)
SELECT unnest({'a': [1, 2, 3], 'b': 88}, recursive := true);

It would be nice to unnest a single level with plain unnest() with no arguments. An optional extension would be to allow some kind of recursive unnesting with recursive=True. But those are two different things.

@fzyzcjy
Copy link

fzyzcjy commented Apr 3, 2024

Looking forward to seeing it merged!

P.S. Spent a minute slightly modified it to avoid some linter warnings and add type info, pasted here in case someone needs it:

import polars as pl


def unnest_all(df: pl.DataFrame, separator=".") -> pl.DataFrame:
    """https://github.com/pola-rs/polars/issues/12353"""
    return df.select(_unnest_all(df.schema, separator))


def _unnest_all(schema, separator):
    for (col, *fields), dtype in _unnest(schema, []):
        expr = pl.col(col)

        for field in fields:
            expr = expr.struct[field]

        if col == "":
            name = separator.join(fields)
        else:
            name = separator.join([col] + fields)

        yield expr.alias(name)


def _unnest(schema, path):
    for name, dtype in schema.items():
        base_type = dtype.base_type()

        if base_type == pl.Struct:
            yield from _unnest(dtype.to_schema(), path + [name])
        else:
            yield path + [name], dtype

@mallport
Copy link

mallport commented Apr 3, 2024

Thanks @cmdlineluser and @fzyzcjy for sharing, that code snippet was useful to me! Very slick.

The inverse operation of "unnest_all" would also be very useful - re-nesting normalized columns based on a separator.

@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label May 31, 2024
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 31, 2024

from discussion:

  1. there should be a selector for struct dtype
  2. unnest() (without arguments) should probably just use that (rather than adding unnest_all)

@daviewales
Copy link

It would be nice to be able to recursively unnest both lists and structs with an automatic prefix based on the column name. I've commented some functions to do this in another issue: #7078 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-struct Area: struct data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

10 participants