`.unnest_all()` #12353

cmdlineluser · 2023-11-09T20:46:16Z

Description

Requests for this functionality (or a subset of) exist across quite a few issues (and several Stack Overflow questions):

unnest_all has cropped up a few times, so I've just chosen that name as a placeholder.

The basic use case is to allow:

df = pl.DataFrame({
    "x": [{"foo":{"a": 1, "b": 2}}, {"foo":{"a": 3, "b": 4}}],
    "y": [{"bar":{"a": 5, "b": 6}}, {"bar":{"a": 7, "b": 8}}]
})

df.pipe(unnest_all)

# shape: (2, 4)
# ┌─────────┬─────────┬─────────┬─────────┐
# │ x.foo.a ┆ x.foo.b ┆ y.bar.a ┆ y.bar.b │
# │ ---     ┆ ---     ┆ ---     ┆ ---     │
# │ i64     ┆ i64     ┆ i64     ┆ i64     │
# ╞═════════╪═════════╪═════════╪═════════╡
# │ 1       ┆ 2       ┆ 5       ┆ 6       │
# │ 3       ┆ 4       ┆ 7       ┆ 8       │
# └─────────┴─────────┴─────────┴─────────┘

My latest attempt at a Python helper for this is to walk the schema to build the expressions:

def _unnest_all(schema, separator):
    def _unnest(schema, path=[]):
        for name, dtype in schema.items():
            base_type = dtype.base_type()
            
            if base_type == pl.Struct:
                yield from _unnest(dtype.to_schema(), path + [name])
            else:
                yield path + [name], dtype
                
    for (col, *fields), dtype in _unnest(schema):
        expr = pl.col(col)
        
        for field in fields:
            expr = expr.struct[field]
            
        if col == "":
            name = separator.join(fields)
        else:
            name = separator.join([col] + fields)
        
        yield expr.alias(name)
        
def unnest_all(df, separator="."): 
    return df.select(_unnest_all(df.schema, separator))

However, I think the real benefit of this functionality (and the reason for this issue) is that it allows Polars to be used for interactively exploring nested data.

an interesting example, polars expressions:

pl.debug = lambda self: pl.select(pl.lit(self.meta.write_json()).str.json_extract().alias("")).pipe(unnest_all)

pl.debug((pl.col("a") + pl.col("b")).over("c", "d"))

# shape: (1, 5)
# ┌────────────────────────────────────────┬───────────────────────────────┬─────────────────────────────────────────┬─────────────────────┬─────────────────────┐
# │ Window.function.BinaryExpr.left.Column ┆ Window.function.BinaryExpr.op ┆ Window.function.BinaryExpr.right.Column ┆ Window.partition_by ┆ Window.options.Over │
# │ ---                                    ┆ ---                           ┆ ---                                     ┆ ---                 ┆ ---                 │
# │ str                                    ┆ str                           ┆ str                                     ┆ list[struct[1]]     ┆ str                 │
# ╞════════════════════════════════════════╪═══════════════════════════════╪═════════════════════════════════════════╪═════════════════════╪═════════════════════╡
# │ a                                      ┆ Plus                          ┆ b                                       ┆ [{"c"}, {"d"}]      ┆ GroupsToRows        │
# └────────────────────────────────────────┴───────────────────────────────┴─────────────────────────────────────────┴─────────────────────┴─────────────────────┘

Using Polars to load "JSON" data in the REPL and interactively explore it with .unnest_all() and .explode() is rather nice.

A proper implementation of this would be super useful.

The text was updated successfully, but these errors were encountered:

itay747 · 2023-12-02T01:55:23Z

A high performance port of pd.json_normalize would be really useful. There isn't any Python library that actually does json normalization particularly well, including the pandas one, because it falls short of handling the common case of arbitrarily deeply nested (k,v) being a str key and a v -> Iterable[dict[str, str | float | int]] type. In the pandas case, the "normalized" result will be emit v, which really falls short of what is possible/expected.

There should be specific logic here that does heuristic inspection on v to avoid doing the flatten-dict's approach of using integers as keys, which results in a crude a.0.id key in the flattened dictionary. So this could be something like

Find a .*id.* key in the list[dict[str, Any] and then
Verify all entries in the list have that key present, with unique and atomic values.
Failing that, default back to flatten-dict's approach.

Perhaps this behaviour could be customizable with a parameter to the eventual pl.normalize_json function that accepts a regex for dealing with this, but let's limit it to just this parameter. I would want to avoid the pandas approach of accepting a meta/meta_prefix/record_prefix argument - it's simply too verbose in practice for deeply nested dictionaries, not to mention makes the assumption that these keys even exist, in the context of a function that is specifically made for making deeply nested JSON objects manageable.

Oh yeah, and above all: This function should follow the standard source: str | Path | IOBase | bytes function ParamSpec in line with other polars.io.read_* ops.

Wainberg · 2024-01-02T06:01:59Z

This seems really useful! How about just df.unnest() with no arguments, rather than df.unnest_all()?

niccolopetti · 2024-01-27T12:38:25Z

Just had to use this on my project, I hope to see it merged soon

cmdlineluser · 2024-01-27T13:12:42Z

Yeah, I just took the name to use as a placeholder.

DuckDB seems to have a recursive parameter for it.

duckdb.sql("""
from df 
select unnest(x), unnest(y)
""")

# ┌────────────────────────────┬────────────────────────────┐
# │            foo             │            bar             │
# │ struct(a bigint, b bigint) │ struct(a bigint, b bigint) │
# ├────────────────────────────┼────────────────────────────┤
# │ {'a': 1, 'b': 2}           │ {'a': 5, 'b': 6}           │
# │ {'a': 3, 'b': 4}           │ {'a': 7, 'b': 8}           │
# └────────────────────────────┴────────────────────────────┘

duckdb.sql("""
from df 
select unnest(x, recursive := true), unnest(y, recursive := true)
""")

# ┌───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │
# │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┤
# │     1 │     2 │     5 │     6 │
# │     3 │     4 │     7 │     8 │
# └───────┴───────┴───────┴───────┘

(although it doesn't seem possible to keep the "path")

Wainberg · 2024-01-27T17:57:51Z

Worth noting that DuckDB's recursive is actually a mix of polars's unnest and polars's list.explode:

-- unnesting a list of lists recursively, generating 5 rows (1, 2, 3, 4, 5)
SELECT unnest([[1, 2, 3], [4, 5]], recursive := true);
-- unnesting a list of structs recursively, generating two rows of two columns (a, b)
SELECT unnest([{'a': 42, 'b': 84}, {'a': 100, 'b': NULL}], recursive := true);
-- unnesting a struct, generating two columns (a, b)
SELECT unnest({'a': [1, 2, 3], 'b': 88}, recursive := true);

It would be nice to unnest a single level with plain unnest() with no arguments. An optional extension would be to allow some kind of recursive unnesting with recursive=True. But those are two different things.

fzyzcjy · 2024-04-03T12:07:06Z

Looking forward to seeing it merged!

P.S. Spent a minute slightly modified it to avoid some linter warnings and add type info, pasted here in case someone needs it:

import polars as pl


def unnest_all(df: pl.DataFrame, separator=".") -> pl.DataFrame:
    """https://github.com/pola-rs/polars/issues/12353"""
    return df.select(_unnest_all(df.schema, separator))


def _unnest_all(schema, separator):
    for (col, *fields), dtype in _unnest(schema, []):
        expr = pl.col(col)

        for field in fields:
            expr = expr.struct[field]

        if col == "":
            name = separator.join(fields)
        else:
            name = separator.join([col] + fields)

        yield expr.alias(name)


def _unnest(schema, path):
    for name, dtype in schema.items():
        base_type = dtype.base_type()

        if base_type == pl.Struct:
            yield from _unnest(dtype.to_schema(), path + [name])
        else:
            yield path + [name], dtype

mallport · 2024-04-03T13:38:50Z

Thanks @cmdlineluser and @fzyzcjy for sharing, that code snippet was useful to me! Very slick.

The inverse operation of "unnest_all" would also be very useful - re-nesting normalized columns based on a separator.

MarcoGorelli · 2024-05-31T18:02:18Z

from discussion:

there should be a selector for struct dtype
unnest() (without arguments) should probably just use that (rather than adding unnest_all)

daviewales · 2024-07-30T14:02:38Z

It would be nice to be able to recursively unnest both lists and structs with an automatic prefix based on the column name. I've commented some functions to do this in another issue: #7078 (comment).

cmdlineluser added the enhancement New feature or an improvement of an existing feature label Nov 9, 2023

deanm0000 added needs triage Awaiting prioritization by a maintainer A-dtype-struct Area: struct data type labels Jan 22, 2024

cmdlineluser mentioned this issue Feb 1, 2024

Support name prefix, suffix, ... on struct field names #14173

Closed

cmdlineluser mentioned this issue May 6, 2024

Add ability to specify prefix for unnest #16068

Closed

stinodego removed the needs triage Awaiting prioritization by a maintainer label May 31, 2024

MarcoGorelli added the accepted Ready for implementation label May 31, 2024

deanm0000 mentioned this issue Jul 25, 2024

unnest() should have option to rename fields with separator #17856

Closed

This was referenced Oct 5, 2024

unnest: allow renaming, make applicable to lists, and allow *all #19110

Open

Make unnest() "unnest" all columns by default #18936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.unnest_all()` #12353

`.unnest_all()` #12353

cmdlineluser commented Nov 9, 2023

itay747 commented Dec 2, 2023 •

edited

Loading

Wainberg commented Jan 2, 2024

niccolopetti commented Jan 27, 2024

cmdlineluser commented Jan 27, 2024

Wainberg commented Jan 27, 2024

fzyzcjy commented Apr 3, 2024

mallport commented Apr 3, 2024

MarcoGorelli commented May 31, 2024 •

edited

Loading

daviewales commented Jul 30, 2024

.unnest_all() #12353

.unnest_all() #12353

Comments

cmdlineluser commented Nov 9, 2023

Description

itay747 commented Dec 2, 2023 • edited Loading

Wainberg commented Jan 2, 2024

niccolopetti commented Jan 27, 2024

cmdlineluser commented Jan 27, 2024

Wainberg commented Jan 27, 2024

fzyzcjy commented Apr 3, 2024

mallport commented Apr 3, 2024

MarcoGorelli commented May 31, 2024 • edited Loading

daviewales commented Jul 30, 2024

`.unnest_all()` #12353

`.unnest_all()` #12353

itay747 commented Dec 2, 2023 •

edited

Loading

MarcoGorelli commented May 31, 2024 •

edited

Loading