Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas nested data infeasible operations #17

Closed
mroeschke opened this issue Sep 19, 2022 · 3 comments
Closed

pandas nested data infeasible operations #17

mroeschke opened this issue Sep 19, 2022 · 3 comments

Comments

@mroeschke
Copy link
Collaborator

pandas generally discourages having nested data in a DataFrame or Series. For nested data in pandas, I tend to group the types of nested data as:

  1. Array-like (N-D)
In [4]: nested_array_like = pd.Series([[1, 2], [2, 3]])

In [5]: nested_array_like
Out[5]:
0    [1, 2]
1    [2, 3]
dtype: object

The only behavior somewhat defined and tested for array-like (python list specifically) is addition which acts like an append

In [14]: nested_array_like + nested_array_like
Out[14]:
0    [1, 2, 1, 2]
1    [2, 3, 2, 3]
dtype: object

And there is explode which encourages users to unnest their data

In [15]: nested_array_like.explode()
Out[15]:
0    1
0    2
1    2
1    3
dtype: object

Some operations I have seen tried by users with array-like data is:

  • groupby the array-like values
  • element-wise operations (e.g. add 2 to each element in the array)
  • reduction-wise operations per array-like value (e.g. sum each array)
  • indexing/selecting/slicing the array-like values
  • containment operations (e.g. 2 in each array -> True/False)
  1. Key-Value-like
In [6]: nested_kv_like = pd.Series([{1:2}, {2:3}])

In [7]: nested_kv_like
Out[7]:
0    {1: 2}
1    {2: 3}
dtype: object

The only behavior supported and tested for dict-like is dict.get via the str accessor (which is somewhat strange IMO)

In [13]: nested_kv_like.str.get(1)
Out[13]:
0    2.0
1    NaN
dtype: float64

A lot of the same operations described above I've seen users try with key-value-like data except specifically treating the keys or values as "arrays"

@martindurant
Copy link
Member

These are certainly worthwhile processing models that we will chase. However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?

We are finding that nested/ragged data just doesn't show up a lot in python exactly because no one knows what to do with them - even though they are ubiquitous in the real world. We could probably do something interesting with the likes of https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs , for instance. We have the following specific cases in mind for examples:

  • chicago taxis, which have details of the exact ride path as a sequence of lat/lon pairs
  • million-songs, which has various analyses of the bars of various songs
  • NYC building outline polygons
  • scientific telemetry from floating buoys

Any other suggestions?

@martindurant
Copy link
Member

https://pythonspeed.com/articles/json-memory-streaming/ a smallish example we can directly compare to; takes 23MB for ak in memory, but a very complicated typestring.

@mroeschke
Copy link
Collaborator Author

However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?

Ah I see. Sorry I am not too familiar of public-ish datasets/workflows for this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants