pandas nested data infeasible operations #17

mroeschke · 2022-09-19T18:51:38Z

pandas generally discourages having nested data in a DataFrame or Series. For nested data in pandas, I tend to group the types of nested data as:

Array-like (N-D)

In [4]: nested_array_like = pd.Series([[1, 2], [2, 3]])

In [5]: nested_array_like
Out[5]:
0    [1, 2]
1    [2, 3]
dtype: object

The only behavior somewhat defined and tested for array-like (python list specifically) is addition which acts like an append

In [14]: nested_array_like + nested_array_like
Out[14]:
0    [1, 2, 1, 2]
1    [2, 3, 2, 3]
dtype: object

And there is explode which encourages users to unnest their data

In [15]: nested_array_like.explode()
Out[15]:
0    1
0    2
1    2
1    3
dtype: object

Some operations I have seen tried by users with array-like data is:

groupby the array-like values
element-wise operations (e.g. add 2 to each element in the array)
reduction-wise operations per array-like value (e.g. sum each array)
indexing/selecting/slicing the array-like values
containment operations (e.g. 2 in each array -> True/False)

Key-Value-like

In [6]: nested_kv_like = pd.Series([{1:2}, {2:3}])

In [7]: nested_kv_like
Out[7]:
0    {1: 2}
1    {2: 3}
dtype: object

The only behavior supported and tested for dict-like is dict.get via the str accessor (which is somewhat strange IMO)

In [13]: nested_kv_like.str.get(1)
Out[13]:
0    2.0
1    NaN
dtype: float64

A lot of the same operations described above I've seen users try with key-value-like data except specifically treating the keys or values as "arrays"

The text was updated successfully, but these errors were encountered:

martindurant · 2022-09-20T17:49:38Z

These are certainly worthwhile processing models that we will chase. However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?

We are finding that nested/ragged data just doesn't show up a lot in python exactly because no one knows what to do with them - even though they are ubiquitous in the real world. We could probably do something interesting with the likes of https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs , for instance. We have the following specific cases in mind for examples:

chicago taxis, which have details of the exact ride path as a sequence of lat/lon pairs
million-songs, which has various analyses of the bars of various songs
NYC building outline polygons
scientific telemetry from floating buoys

Any other suggestions?

martindurant · 2022-09-20T17:59:00Z

https://pythonspeed.com/articles/json-memory-streaming/ a smallish example we can directly compare to; takes 23MB for ak in memory, but a very complicated typestring.

mroeschke · 2022-09-20T18:09:00Z

However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?

Ah I see. Sorry I am not too familiar of public-ish datasets/workflows for this case.

martindurant closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas nested data infeasible operations #17

pandas nested data infeasible operations #17

mroeschke commented Sep 19, 2022

martindurant commented Sep 20, 2022

martindurant commented Sep 20, 2022

mroeschke commented Sep 20, 2022

pandas nested data infeasible operations #17

pandas nested data infeasible operations #17

Comments

mroeschke commented Sep 19, 2022

martindurant commented Sep 20, 2022

martindurant commented Sep 20, 2022

mroeschke commented Sep 20, 2022