Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support construction of a DataFrame from a Mapping #58814

Closed
wants to merge 10 commits into from

Conversation

mrkn
Copy link

@mrkn mrkn commented May 23, 2024

This pull-request adds support for using a Mapping in the construction of a DataFrame. By applying this change, the following Julia code will work.

julia> using PythonCall
julia> pd = pyimport("pandas")
julia> df = pd.DataFrame(Dict("a" => [1, 2, 3], "b" => [4, 5, 6]))
Python:
   b  a
0  4  1
1  5  2
2  6  3

I have confirmed that there is no performance degradation when constructing a DataFrame from a dict.

$ asv continuous -E virtualenv -b ^frame_ctor.FromDict origin/main HEAD
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter
·· Installing 684a22fb <support_mapping_in_dataframe> into virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter..
· Running 16 total benchmarks (2 commits * 1 environments * 8 benchmarks)
[ 0.00%] · For pandas commit 2aa155ae <main> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter..
[ 0.00%] ·· Benchmarking virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter
[ 3.12%] ··· Running (frame_ctor.FromDicts.time_dict_of_categoricals--)........
[25.00%] · For pandas commit 684a22fb <support_mapping_in_dataframe> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter..
[25.00%] ·· Benchmarking virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter
[28.12%] ··· Running (frame_ctor.FromDicts.time_dict_of_categoricals--)........
[50.00%] · For pandas commit 684a22fb <support_mapping_in_dataframe> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter
[53.12%] ··· frame_ctor.FromDicts.time_dict_of_categoricals                                327±8μs
[56.25%] ··· frame_ctor.FromDicts.time_list_of_dict                                    16.8±0.09ms
[59.38%] ··· frame_ctor.FromDicts.time_nested_dict                                      16.1±0.2ms
[62.50%] ··· frame_ctor.FromDicts.time_nested_dict_columns                              16.3±0.3ms
[65.62%] ··· frame_ctor.FromDicts.time_nested_dict_index                                13.2±0.1ms
[68.75%] ··· frame_ctor.FromDicts.time_nested_dict_index_columns                       12.9±0.08ms
[71.88%] ··· frame_ctor.FromDicts.time_nested_dict_int64                                29.1±0.1ms
[75.00%] ··· frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets                  ok
[75.00%] ··· ======== ============
              offset
             -------- ------------
              <Nano>   7.97±0.1ms
              <Hour>   10.9±0.2ms
             ======== ============

[75.00%] · For pandas commit 2aa155ae <main> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter..
[75.00%] ·· Benchmarking virtualenv-py3.10-Cython3.0-jinja2-matplotlib-meson-meson-python-numba-numexpr-odfpy-openpyxl-pyarrow-python-build-scipy-sqlalchemy-tables-xlrd-xlsxwriter
[78.12%] ··· frame_ctor.FromDicts.time_dict_of_categoricals                                330±2μs
[81.25%] ··· frame_ctor.FromDicts.time_list_of_dict                                     17.2±0.3ms
[84.38%] ··· frame_ctor.FromDicts.time_nested_dict                                     16.2±0.08ms
[87.50%] ··· frame_ctor.FromDicts.time_nested_dict_columns                              16.2±0.1ms
[90.62%] ··· frame_ctor.FromDicts.time_nested_dict_index                                13.2±0.2ms
[93.75%] ··· frame_ctor.FromDicts.time_nested_dict_index_columns                        13.5±0.2ms
[96.88%] ··· frame_ctor.FromDicts.time_nested_dict_int64                                29.7±0.3ms
[100.00%] ··· frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets                  ok
[100.00%] ··· ======== =============
               offset
              -------- -------------
               <Nano>   8.16±0.07ms
               <Hour>   11.2±0.06ms
              ======== =============


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Can you please update the documentation of DataFrame.__init__ (around line 517) to mention Mapping instead of dict?

It might be good to check whether DataFrame.from_dict can also accept a mapping.

@mrkn
Copy link
Author

mrkn commented May 26, 2024

@twoertwein I noticed that creating a DataFrame from a Mapping of Mappings wasn't covered, so I added support for it. Could you please review this new change?

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

(DataFrame.from_dict is documented to allow also array-like types that contain Mappings but that can be a separate PR, pandas-dev/pandas-stubs#928 and pandas-dev/pandas-stubs#929)

@mrkn mrkn force-pushed the support_mapping_in_dataframe branch from b8273d0 to 375379d Compare May 27, 2024 04:00
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait until the discussion on the issue reaches consensus

Copy link
Contributor

github-actions bot commented Jul 1, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 1, 2024
@mrkn mrkn force-pushed the support_mapping_in_dataframe branch from 375379d to 88cb50f Compare July 4, 2024 15:54
@mroeschke
Copy link
Member

Thanks for the PR here, but let's centralize the discussion in the issue regarding whether pandas should make this more general. Closing for now, but we can reopen if there is consensus to change in the issue

@mroeschke mroeschke closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants