Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

Closed
wants to merge 8 commits into from
Closed

ENH GH20601 raise error when pivot table's number of levels > int32 #20784

wants to merge 8 commits into from

Conversation

anhqle
Copy link

@anhqle anhqle commented Apr 22, 2018

Because the error of number of levels exceeding int32 can happen both in pivot_table and unstack, I catch the error as early as possible in both places.

Arguments for and against catching the error in pivot_table:

  • For: pivot_table does aggregation before calling unstack, so it takes a while before the error is raised
  • Against: pivot_table ultimately calls unstack, so it's not worth it to check for the error in both places

I'm happy to go with either way as you prefer.

@pep8speaks
Copy link

pep8speaks commented Apr 22, 2018

Hello @anhqle! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 31, 2018 at 17:22 Hours UTC

@codecov
Copy link

codecov bot commented Apr 22, 2018

Codecov Report

Merging #20784 into master will decrease coverage by <.01%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20784      +/-   ##
==========================================
- Coverage   92.07%   92.06%   -0.01%     
==========================================
  Files         170      170              
  Lines       50688    50693       +5     
==========================================
+ Hits        46671    46672       +1     
- Misses       4017     4021       +4
Flag Coverage Δ
#multiple 90.47% <80%> (-0.01%) ⬇️
#single 42.28% <0%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/pivot.py 97.03% <100%> (ø) ⬆️
pandas/core/reshape/reshape.py 99.57% <75%> (-0.22%) ⬇️
pandas/core/dtypes/common.py 94.87% <0%> (-0.34%) ⬇️
pandas/util/testing.py 85.69% <0%> (-0.21%) ⬇️
pandas/core/arrays/datetimes.py 95.44% <0%> (-0.03%) ⬇️
pandas/core/dtypes/dtypes.py 96.05% <0%> (+0.02%) ⬆️
pandas/core/indexes/period.py 93.5% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c272c52...7e6246c. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2018

generally use the original PR and push to it. having a new one is confusing.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas Numeric Operations Arithmetic, Comparison, and Logical operations labels Apr 22, 2018
@@ -29,6 +29,11 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
index = _convert_by(index)
columns = _convert_by(columns)

num_rows = data.reindex(index, axis='columns').shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are doing extra work here (e.g. the reindex), can we not do the calculation directly?

@@ -29,6 +29,11 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
index = _convert_by(index)
columns = _convert_by(columns)

num_rows = data.reindex(index, axis='columns').shape[0]
num_columns = data.reindex(columns, axis='columns').shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is error is now in 2 places

num_rows = (data.reindex(columns=index).drop_duplicates().shape[0]
if index else 1)
num_cols = (data.reindex(columns=columns).drop_duplicates().shape[0]
if columns else 1)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to get an accurate size of the resulting pivot table, taking into account potential duplicates.

However I recognize that the performance hit may not be worth it given that this is such an edge case. It's up to you whether we should just raise the error within unstack.

@anhqle
Copy link
Author

anhqle commented Apr 22, 2018

My apology about creating a new PR. I pull --rebase upstream master in the original PR and the diff got polluted by other commits. Am I supposed to pull --rebase upstream master, or the maintainers are in charge of the final merge into master?

Thanks again for your guidance -- I'm learning my way through the codebase and hope to be an active contributor in the future.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2018

your u should rebase against upstream every time you push - it makes it so your code is not out of date wrt to master
but you don’t need to squash

@anhqle
Copy link
Author

anhqle commented Jun 9, 2018

Do you think this PR is ready to be merged? Happy to do any additional work.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

can you rebase

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note, bug fixes in reshaping for 0.24.0

@@ -126,6 +126,13 @@ def __init__(self, values, index, level=-1, value_columns=None,
self.removed_level = self.new_index_levels.pop(self.level)
self.removed_level_full = index.levels[self.level]

num_rows = np.max([index_level.size for index_level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on the checking here

@jreback jreback added this to the 0.24.0 milestone Jul 31, 2018
@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

can you rebase

@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

closing in favor of #23512

@jreback jreback closed this Nov 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: pivot_table when number of levels larger than int32 range
3 participants