-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas nested data infeasible operations #17
Comments
These are certainly worthwhile processing models that we will chase. However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow? We are finding that nested/ragged data just doesn't show up a lot in python exactly because no one knows what to do with them - even though they are ubiquitous in the real world. We could probably do something interesting with the likes of https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs , for instance. We have the following specific cases in mind for examples:
Any other suggestions? |
https://pythonspeed.com/articles/json-memory-streaming/ a smallish example we can directly compare to; takes 23MB for ak in memory, but a very complicated typestring. |
Ah I see. Sorry I am not too familiar of public-ish datasets/workflows for this case. |
pandas generally discourages having nested data in a
DataFrame
orSeries
. For nested data in pandas, I tend to group the types of nested data as:The only behavior somewhat defined and tested for array-like (python
list
specifically) is addition which acts like anappend
And there is
explode
which encourages users tounnest
their dataSome operations I have seen tried by users with array-like data is:
groupby
the array-like valuessum
each array)The only behavior supported and tested for dict-like is
dict.get
via thestr
accessor (which is somewhat strange IMO)A lot of the same operations described above I've seen users try with key-value-like data except specifically treating the
keys
orvalues
as "arrays"The text was updated successfully, but these errors were encountered: