Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: DataFrame constructor from list dataclasses #44306

Closed
3 tasks done
ezerkar opened this issue Nov 3, 2021 · 4 comments
Closed
3 tasks done

PERF: DataFrame constructor from list dataclasses #44306

ezerkar opened this issue Nov 3, 2021 · 4 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance

Comments

@ezerkar
Copy link

ezerkar commented Nov 3, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

@dataclass()
class Example:
    first: int
    second: int

class_list = [Example(random.randint(0,1000), random.randint(0,1000)) for x in range(1000)]

pd.DataFrame(class_list)
6.1 ms ± 902 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is probably because the constructor uses asdict which is quite slow, think we can make the constructor work without asdict, something along these lines:

pd.DataFrame([(x.first, x.second) for x in class_list], columns = ['first', 'second'])
653 µs ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-38-generic
Version : #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_IL
LOCALE : en_IL.UTF-8

pandas : 1.3.4
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : 0.14.1
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

Prior Performance

No response

@ezerkar ezerkar added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Nov 3, 2021
@mroeschke mroeschke added Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 6, 2021
@phofl
Copy link
Member

phofl commented Dec 22, 2021

It's not that simple unfortunately. asdict resolves the attributes recursively. Also your example would not cover different types of dataclasses

@ezerkar
Copy link
Author

ezerkar commented Dec 23, 2021

Yes you are right, and I haven't realised that when first posting the suggestion.
Saying that, I'm not sure losing the recursion is entirely bad as right now this constructor is more similar to json normalizer than to a plain constructor.
For instance let's say that one of the fields in the dataclass is a dataclass on its own, the current asdict based constructor will open that to columns, while the user might want it to be a single column with a dataclass in it.
But this is a much wider discussion.

@phofl
Copy link
Member

phofl commented Dec 23, 2021

Yep you are correct, this would be an API change.

Also I personally don't like DataFrames with nested data, so I would prefer that my dataclass gets resolved.

@ezerkar
Copy link
Author

ezerkar commented Dec 24, 2021

OK, thanks,
see your point , makes sense
Closing

@ezerkar ezerkar closed this as completed Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants