Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: join_axes-kwarg in pd.concat #22318

Merged
merged 6 commits into from Jul 3, 2019

Conversation

h-vetinari
Copy link
Contributor

@pep8speaks
Copy link

pep8speaks commented Aug 13, 2018

Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-03 02:21:51 UTC

@codecov
Copy link

codecov bot commented Aug 13, 2018

Codecov Report

Merging #22318 into master will increase coverage by 50.26%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #22318       +/-   ##
===========================================
+ Coverage   41.96%   92.23%   +50.26%     
===========================================
  Files         180      161       -19     
  Lines       50860    51333      +473     
===========================================
+ Hits        21345    47348    +26003     
+ Misses      29515     3985    -25530
Flag Coverage Δ
#multiple 90.62% <100%> (?)
#single 42.29% <34.48%> (+0.32%) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 87.1% <100%> (+72.32%) ⬆️
pandas/core/frame.py 97.03% <100%> (+62.07%) ⬆️
pandas/core/reshape/concat.py 97.66% <100%> (+53.66%) ⬆️
pandas/core/generic.py 96.81% <100%> (+58.18%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/compat/__init__.py 58.36% <0%> (-33.64%) ⬇️
pandas/io/clipboard/clipboards.py 28.23% <0%> (-6.55%) ⬇️
pandas/plotting/_misc.py 38.98% <0%> (-4.26%) ⬇️
pandas/io/formats/console.py 75.75% <0%> (-2.37%) ⬇️
pandas/_libs/__init__.py 100% <0%> (ø) ⬆️
... and 174 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b618eb6...154ab0e. Read the comment docs.

pandas/core/reshape/concat.py Outdated Show resolved Hide resolved
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Deprecate Functionality to remove in pandas labels Aug 14, 2018
@@ -6614,11 +6614,11 @@ def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
if can_concat:
if how == 'left':
how = 'outer'
join_axes = [self.index]
return concat(frames, axis=1, join=how,
verify_integrity=True).reindex(self.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add copy=False to the .reindex

@@ -8848,7 +8848,7 @@ def describe_1d(data):
if name not in names:
names.append(name)

d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
d = pd.concat([x.reindex(names) for x in ldesc], axis=1, sort=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add copy=False

@@ -518,8 +518,10 @@ def _transform_general(self, func, *args, **kwargs):
applied.append(res)

concat_index = obj.columns if self.axis == 0 else obj.index
concatenated = concat(applied, join_axes=[concat_index],
axis=self.axis, verify_integrity=False)
other_axis = (self.axis + 1) % 2 # switches from 0 to 1 or from 1 to 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use this, just use an if else

pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/core/reshape/concat.py Outdated Show resolved Hide resolved
pandas/core/reshape/concat.py Outdated Show resolved Hide resolved
pandas/tests/reshape/test_concat.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, lgtm. ping on green.

if ndim == 2:
other_axis = 1 if axis == 0 else 0 # switches between 0 & 1
res = res.reindex(join_axes[0], axis=other_axis)
else: # 3 for Panel; Panel4D already deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to mention Panel4D.

use elif ndim == 3 to be explicit

@jreback jreback added this to the 0.24.0 milestone Aug 23, 2018
@jreback
Copy link
Contributor

jreback commented Aug 23, 2018

@jorisvandenbossche you had some comments on the issue. I am all for making things simpler.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 23, 2018

@jorisvandenbossche
I'm answering your comment from #21951 here, as this is further ahead:

"but this will be fixed by 21160 anyway." -> that is long off to be the default integer type, so I don't think we should use that as an argument now

Fair enough, that was a misunderstanding on my part. But the dtype-question is a side-issue here, IMO

"with reindex and reindex_like, it is redundant as well" -> I don't know the implementation, but I would assume that reindexing after the fact can be less performant? (assuming that with join_axes it reindexes each input before concatenating)

I posed this question in the OP ("Only question is if performance would be much worse, if concatenating huge Series/DFs before selecting small index-subset."), and it's a valid point. The user could apply a reindex directly to the arguments of concat (I originally had something to that effect in the FutureWarning in this PR), but I see that it would be good to keep having a way to do this easily.

It is certainly true the way it is explained and spelled (eg the fact that you need to pass a list) is certainly outdated now Panel is removed. But we could also consider improving it.

In my opinion, the join_axes-keyword (as a name, not necessarily as functionality) should definitely be deprecated. The question of what to replace it with is very similar to the discussion in #21855 (comment).

There might be a good solution for replacing the join_axes-functionality with something like the following (conceptually):

concat(objs, axis=0, join='outer', index=None, columns=None, ignore_index=False,
       keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
[...]
if (axis == 0 and index is not None) or (axis == 1 and columns is not None):
   raise ValueError('Cannot set index of concatenation axis; '
                    'can only set index of non-concatenation axis')
index = join if index is None else index
columns = join if columns is None else columns
[...]
objs = [x.reindex(index/columns, axis=other_axis) for x in objs]
[...]
# actual concatenation

Then index the non-concatenation index of the result could be set as desired (essentially replacing join_axes with a cleaner API, now that Panel will be deprecated).

(note that I am not married to the keyword, I have never used it myself, but just think we should have a bit more discussion about it. It would be interesting to search for usage on github/SO)

There's not much on SO - https://www.google.com/search?q=stackoverflow+pandas+concat+join_axes yields only one (sorta) relevant hit: https://stackoverflow.com/q/27391081

Neither did I find much of substance here: https://github.com/search?p=2&q=join_axes&type=Issues

@h-vetinari
Copy link
Contributor Author

@jreback

small comment, lgtm. ping on green.

All green (rebased due to persistent test failures).

@h-vetinari
Copy link
Contributor Author

@jreback

Green and all feedback incorporated. Anything missing?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments. ping on green.

@@ -8930,7 +8930,8 @@ def describe_1d(data):
if name not in names:
names.append(name)

d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
d = pd.concat([x.reindex(names) for x in ldesc], axis=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the copy=False on . the .reindex. remove copy= from the pd.concat it is not respected misleading here.

concatenated = concat(applied, join_axes=[concat_index],
axis=self.axis, verify_integrity=False)
other_axis = 1 if self.axis == 0 else 0 # switches between 0 & 1
concatenated = concat(applied, axis=self.axis,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove copy= from the concat

pandas/core/frame.py Show resolved Hide resolved
pandas/core/reshape/concat.py Outdated Show resolved Hide resolved
"length {length}".format(length=ndim - 1))
if ndim == 2:
other_axis = 1 if axis == 0 else 0 # switches between 0 & 1
res = res.reindex(join_axes[0], axis=other_axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add copy=False to the .reindex

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

I answered above (#22318 (comment)) to your remarks/questions in the issue. Any further comments, API- or otherwise?

@h-vetinari
Copy link
Contributor Author

@jreback

some comments. ping on green.

ping on green

@jreback
Copy link
Contributor

jreback commented Sep 5, 2018

@jorisvandenbossche i am ok with this
any objections

@jorisvandenbossche
Copy link
Member

@h-vetinari sorry for my slow follow-up, didn't have much time the last weeks

I also saw that the discussion here (about an alternative way to do it) is related to the joining discussions in other PRs/issues (you linked #21855). I will try to take a look at those discussions one of the coming days.

I personally think we should first decide on an alternative (or decide upon if we want one) before actually deprecating this, but no strong opinion.

BTW, if we keep the PR as is, the documentation certainly needs an update (there are example of join_axes in the tutorial docs).

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 13, 2018

@jorisvandenbossche

I will try to take a look at those discussions one of the coming days.

Any luck with that?

As a potential shortcut to reading that other discussion, you could have a look at the pseudo-implementation I wrote above in #22318 (comment).

Barring a decision on what to replace the API with, which tutorial docs do you mean? I didn't find anything in https://pandas.pydata.org/pandas-docs/stable/tutorials.html, but there's a bit in doc/source/merging.rst.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

looks ok to me, @jorisvandenbossche has some questions.

@h-vetinari
Copy link
Contributor Author

@jreback @jorisvandenbossche

Barring a decision on what to replace the API with, which tutorial docs do you mean? I didn't find anything in https://pandas.pydata.org/pandas-docs/stable/tutorials.html, but there's a bit in doc/source/merging.rst.

@jreback jreback removed this from the 0.24.0 milestone Nov 6, 2018
@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Can we pick this up again?

@h-vetinari
Copy link
Contributor Author

@gfyoung

@h-vetinari : FYI, instead of pushing commits to trigger entirely new builds, just ping one of us. We can restart individual builds for you.

Could you retrigger that timed-out travis job please? ;-)

@h-vetinari
Copy link
Contributor Author

jorisvandenbossche: Hadn't seen your comment about before I posted that, sorry. Anyway, that should not matter for codecov. But eg if not all CI builds pass, that can also give failures on codecov I think.

All the CI builds had passed, except for codecov (I imagine because it had forgotten the coverage of the commit where this PR had originally branched off)

separation of concerns.

Fair enough...

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

@h-vetinari

in regards to: #22318 (comment)

this is a separate and distinct change. let's treat it as such. Keeping PRs nice and simple is the way to go.

@h-vetinari
Copy link
Contributor Author

Fair point, but without someone doing a PR, we'll have just a deprecation, without the replacement. I think it's equally fair to say that that replacement is part of the deprecation itself (to uphold existing usability where necessary). I'll be gone the next two weeks., but then, this PR is not in a hurry...

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

@h-vetinari IMHO we don't need a replace at all.

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

In any event, if you can respond to the remaining comments (soon) we can merge this.

@h-vetinari
Copy link
Contributor Author

@jreback: @h-vetinari can you update for this point

This PR needs updating in several different ways (also review from @jorisvandenbossche), and as I mentioned I'm currently away for two weeks.

I'd like to clean this up when I come back (realising that it'll be after 0.25), and properly replace the removed functionality with something like join='outer'|'inner'|index-instance.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

@jreback: @h-vetinari can you update for this point

This PR needs updating in several different ways (also review from @jorisvandenbossche), and as I mentioned I'm currently away for two weeks.

I'd like to clean this up when I come back (realising that it'll be after 0.25), and properly replace the removed functionality with something like join='outer'|'inner'|index-instance.

@h-vetinari that is a separate PR that may not be accepted
pls just stick to single purpose things

@TomAugspurger
Copy link
Contributor

What's the status here? Is this happening for 0.25.0?

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

prob needs a small amount of fixup; let me see if I can do this in the next couple of hours.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 2, 2019 via email

@h-vetinari
Copy link
Contributor Author

@jreback @TomAugspurger

I wouldn't merge this before 0.25. It's been open for almost a year, so it's obviously not urgent. And I disagree that the direct replacement for the functionality we're deprecating should be a separate PR. If this gets merged before 0.25 there won't be replacement for that version (except the workaround of using .reindex manually).

OTOH, with a direct replacement in the same PR as the deprecation, one could formulate a clear path for updating.

@jreback jreback merged commit c0a4964 into pandas-dev:master Jul 3, 2019
@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

thanks @h-vetinari

again, 1 PR 1 thing. The reason things take so long (not in this case), but generally is that they are too many things in 1 place.

@jorisvandenbossche
Copy link
Member

I have to say I agree with @h-vetinari that if we want something that replaces this functionality, we should not include this in 0.25.0 (unless we get the replacement in 0.25.0 as well). It is quite annoying to already start warning for something before a replacement is available (again, if we want to add such a replacement).

@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

there already is a very good alternative, .reindex(), no need to re-invent the wheel here. This is less than useful in the non-Panel world. SO I am -1 on adding any replacement at all.

@jorisvandenbossche
Copy link
Member

It was from a discussion between the both of us that I commented #22318 (comment) with the idea of a join='first' or join='left', which is basically a replacement for the number 1 use case of join_axes.
I thought you were OK with that idea.

But still, it is something to discuss. There is also the idea of introducing a join_axis, but I personally think that the join one covers most use cases of join_axi/es.

@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

It was from a discussion between the both of us that I commented #22318 (comment) with the idea of a join='first' or join='left', which is basically a replacement for the number 1 use case of join_axes.
I thought you were OK with that idea.

I mean maybe, but I'd like to see an actual usecase. Removing API is easy, adding new should have a high bar.

@h-vetinari
Copy link
Contributor Author

@jreback
I get the maintainer POV, but from a usability POV it's the other way around - removing API is costly, while additions are easy. And it's not even a new API - it'd just be changing the name/semantics of existing functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEPR: join_axes-kwarg for pd.concat
6 participants