New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: .apply will correctly infer output shape when axis=1 #18577

Merged
merged 12 commits into from Feb 7, 2018

Conversation

Projects
None yet
5 participants
@jreback
Contributor

jreback commented Nov 30, 2017

closes #16353
closes #17348
closes #17437
closes #18573
closes #17970
closes #17892
closes #17602
closes #15628
closes #18775
closes #18901
closes #18919

This fixes apply to work correctly when the returned shape mismatches the original. It will try to set the indices if it possible. Setting to a list-like with axis=1 is now disallowed (but still possible if you operate row-wise). We were applying this inconsitently. This is of course a discouraged practice anyhow.

Prob should add some examples / update the doc-string a bit.

@jreback jreback added this to the 0.22.0 milestone Nov 30, 2017

@codecov

This comment has been minimized.

codecov bot commented Nov 30, 2017

Codecov Report

Merging #18577 into master will decrease coverage by 0.02%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18577      +/-   ##
==========================================
- Coverage   91.45%   91.43%   -0.03%     
==========================================
  Files         157      157              
  Lines       51378    51392      +14     
==========================================
+ Hits        46987    46988       +1     
- Misses       4391     4404      +13
Flag Coverage Δ
#multiple 89.29% <80%> (-0.01%) ⬇️
#single 40.56% <36%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.63% <80%> (-0.28%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c40c8f8...dce186a. Read the comment docs.

@codecov

This comment has been minimized.

codecov bot commented Nov 30, 2017

Codecov Report

Merging #18577 into master will decrease coverage by <.01%.
The diff coverage is 94.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18577      +/-   ##
==========================================
- Coverage    91.6%    91.6%   -0.01%     
==========================================
  Files         150      150              
  Lines       48750    48793      +43     
==========================================
+ Hits        44656    44695      +39     
- Misses       4094     4098       +4
Flag Coverage Δ
#multiple 89.97% <94.21%> (ø) ⬆️
#single 41.74% <47.93%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/style.py 96.22% <100%> (ø) ⬆️
pandas/core/frame.py 97.43% <100%> (ø) ⬆️
pandas/core/sparse/frame.py 94.82% <100%> (-0.03%) ⬇️
pandas/core/apply.py 96.77% <93.75%> (-2.66%) ⬇️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 983d71f...1d93380. Read the comment docs.

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 1, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 7, 2017

@TomAugspurger @jorisvandenbossche if you have a chance

@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Dec 7, 2017

Unfortunately I'm guessing this will break someone's code in an subtle way, but current behavior is obviously bad, so seems like a net positive.

I've never used it so not sure I fully understand the itention, but this does seem to break the reduce keyword, docstring says "If reduce is True a Series will always be returned, and if False a DataFrame will always be returned."

In [23]: pd.__version__
Out[23]: '0.22.0.dev0+310.gf6f0371'

In [24]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=False)
Out[24]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [25]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=True)
Out[25]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3


In [92]: pd.__version__
Out[92]: '0.21.0'

In [93]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=False)
Out[93]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [94]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=True)
Out[94]: 
0    (1, 2, 3)
1    (1, 2, 3)
2    (1, 2, 3)
3    (1, 2, 3)
4    (1, 2, 3)
5    (1, 2, 3)
dtype: object
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Dec 7, 2017

@jreback I will try to review this tomorrow

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 7, 2017

i think i can simply deprecate reduce

let me see

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 8, 2017

reduce= is deprecated, but I think we should actually figure out the use case and fix it. IIRC its if the passed apply function returns an empty, what do you return (a Series or a DataFrame). but need to sort thru some issues. maybe I will remove this and we can try later. (its the last commit and orthogonal).

.. ipython:: python
df = pd.DataFrame([[1,2], [1,2]], columns=['a','b'])

This comment has been minimized.

@jschendel

jschendel Dec 8, 2017

Member

Would be nice to display df here to show what it is prior to the apply, similar to what was done in the example above.

df.apply(lambda x: [1, 2, 3], axis=1)
df.apply(lambda x: [1, 2], axis=1)
The returned input will also *not* return a Series with the list-wrapper as previously.

This comment has been minimized.

@jschendel

jschendel Dec 8, 2017

Member

Could use code formatting here: ``Series``

dtype: object
New Behavior. The behaviour is consistent.

This comment has been minimized.

@jschendel

jschendel Dec 8, 2017

Member

Nitpick: The second "behaviour" is using British spelling, but the first isn't.

dtype: object
New Behaviour

This comment has been minimized.

@jschendel

jschendel Dec 8, 2017

Member

Nitpick: British spelling

@@ -4818,6 +4818,8 @@ def apply(self, func, axis=0, broadcast=False, raw=False, reduce=None,
while guessing, exceptions raised by func will be ignored). If
reduce is True a Series will always be returned, and if False a
DataFrame will always be returned.
.. deprecated:: 0.22.0

This comment has been minimized.

@jschendel

jschendel Dec 8, 2017

Member

Should this be mentioned in the whatsnew?

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 10, 2017

I reverted the deprecation off reduce, I can deprecate but its complicated to actually replace it.

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 10, 2017

thanks for comments @jschendel

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Dec 11, 2017

I didn't look at any code yet, but some general comments from experimenting with the branch:

  • I am big +1 on making those things more consistent, but I think we should provide an alternative to achieve the original (in the cases where it changes), i.e. have a way to specify you want no inference on the result, just put the result as a scalar in a single cell.
    We should also add documentation that explains the rules here to make this more predictable.

  • Column names of the result: using the example of the whatsnew, there is still an inconsistency in the resulting column names:

    In [1]: df = pd.DataFrame(np.random.randn(4, 3), columns=['A', 'B', 'C'])
    
    In [2]: df.apply(lambda x: [1, 2, 3], axis=1)
    Out[2]: 
       A  B  C
    0  1  2  3
    1  1  2  3
    2  1  2  3
    3  1  2  3
    
    In [3]: df.apply(lambda x: [1, 2], axis=1)
    Out[3]: 
       0  1
    0  1  2
    1  1  2
    2  1  2
    3  1  2
    

    So depending on the length of the list, it reuses the original column names or not. I would personally be in favor of fixing this inconsistency to not reuse the column names (although that would be an additional breaking change for the first case).
    Related, when returning a Series it no longer uses the explicit given names of the result, if the length of the result matches that of the original frame:

    In [10]: df.apply(lambda x: pd.Series([1, 2], index=['test', 'other']), 1)
    Out[10]: 
       test  other
    0     1      2
    1     1      2
    2     1      2
    3     1      2
    
    In [11]: df.apply(lambda x: pd.Series([1, 2, 3], index=['test', 'other', 'cols']), 1)
    Out[11]: 
       A  B  C
    0  1  2  3
    1  1  2  3
    2  1  2  3
    3  1  2  3
    

    So this last example ([11]) is a regression, as before it correctly returned the specified new names.

  • raw=True seems to change some of the inference, while, based on the explanation in the docstring, it should only change the values passed to the function and not the inference afterwards on the result of the function.

    In [24]: df.apply(lambda x: (1, 2), axis=1)
    Out[24]: 
       0  1
    0  1  2
    1  1  2
    2  1  2
    3  1  2
    
    In [25]: df.apply(lambda x: (1, 2), axis=1, raw=True)
    Out[25]: 
    0    (1, 2)
    1    (1, 2)
    2    (1, 2)
    3    (1, 2)
    dtype: object
    
  • Regarding the example of @chris-b1 above (#18577 (comment)), so for tuples this change is breaking the behaviour always (regardless of the length of the tuple). Before, tuples were always regarded as 'scalars', unless you specified reduce=False.
    I am not fully sure I like this change, and as said above I think we need an alternative if keeping to this change.

  • An illustration of where this "now always try to unpack iterable things like tuples into columns" would cause a problem in a real world use case:

    This branch:

    In [53]: from shapely.geometry import MultiLineString
    
    In [54]: def f(row):
        ...:     return MultiLineString([[(1,2), (3,4)], [(3,4), (4,2)]])
    
    In [55]: df.apply(f, axis=1)
    Out[55]: 
                           0                      1
    0  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    1  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    2  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    3  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
    

    with 0.21.1 it preserves the returned objects in a single series (the desired result):

    In [94]: df.apply(f, axis=1)
    Out[94]: 
    0    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    1    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    2    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    3    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    4    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    5    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
    dtype: object
    
  • I also noticed that the behaviour when returning arrays changed (to be consistent with lists, so probably the change is good), but just noticing it to make sure it is tested (previously it raised an error when the shape didn't match (eg lambda x: np.array([1, 2]) with the example above), now it behaves like a list).

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Dec 11, 2017

Contemplating my above comments, I think I would be in favor of (1) "no inference" for general iterable things (objects, tuples, ..), and (2) a clear way to enable / disable inference.

The switch to enable/disable inference could have a default depending on the result type, eg by default infer (so not putting the result as a scalar in a single cell) for Series objects and maybe arrays, and by default no inference for tuples, dicts, generic objects, and probably lists as well. When doing this, I think the main breakage is the case of a "list with length equal to number of columns of original frame", but in that case we could even catch this case and raise a FutureWarning.

Note I didn't look at all examples from the issues yet, so I might be missing some cases.

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 11, 2017

so we need some sort of a parameter to allow inference on the returned result. This is analgous to specifying an output dtype & (shape) in np.vectorize maybe something like:

output=infer|scalar|reduce (and can remove the reduce= & raw= kwargs as well

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Dec 14, 2017

(and can remove the reduce= & raw= kwargs as well

I think raw is still orthogonal (it is to determine which type if passed to the function, not related to what is returned by the function).
But broadcast could maybe be integrated.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Dec 14, 2017

What would a output='reduce' do? (compared to output='scalar' ?)

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 14, 2017

yeah going to see what I can do with all of these parameters, calling it result_type for now.

@@ -508,7 +508,8 @@ def _apply(self, func, axis=0, subset=None, **kwargs):
subset = _non_reducing_slice(subset)
data = self.data.loc[subset]
if axis is not None:
result = data.apply(func, axis=axis, **kwargs)
result = data.apply(func, axis=axis,

This comment has been minimized.

@jreback

jreback Dec 28, 2017

Contributor

@TomAugspurger I had to do this work-around here to push a list-like result into a frame (which we are breaking generally in this patch, but here style is relying on this, so kind of ok with this behavior, I have to add some more tests); this is from pandas/tests/io/formats/test_style.test_apply_axis

. e.g.

In [5]:         df = pd.DataFrame({'A': [0, 0], 'B': [1, 1]})
   ...:         f = lambda x: ['val: {max}'.format(max=x.max()) for v in x]
   ...: 

In [6]: df
Out[6]: 
   A  B
0  0  1
1  0  1

In [7]: df.apply(f, axis=1)
Out[7]: 
0    [val: 1, val: 1]
1    [val: 1, val: 1]
dtype: object

In [8]: df.apply(f, axis=1, result_type='infer')
Out[8]: 
        A       B
0  val: 1  val: 1
1  val: 1  val: 1

I am not entirely sure why you need this this way.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jan 28, 2018

Member

The above snippet is no longer correct (I mean: in the meantime we have chosen to not preserve the original columns names when a list is returned, so this will have column names [0, 1]).

This has no impact on the style code?

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 6, 2018

your example is actually wrong, in that the result is good. it IS taking the index of the returned Series, which is [0, 1, 2]. I changed to do this

In [1]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1, columns=['A', 'B', 'C'])
   ...: 

In [2]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[2]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [3]: df.apply(lambda x: Series([1, 2, 3], index=list('abc')), axis=1, result_type='broadcast')
Out[3]: 
   a  b  c
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

for broadcast, a list-like will get the original index, but if a Series is returned it will be the new index. I think this is consistent and useful to do.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 6, 2018

for broadcast, a list-like will get the original index, but if a Series is returned it will be the new index. I think this is consistent and useful to do.

If we decide to not broadcast a Series to the original layout, I think we should at least raise an error for that case, and not silently ignore the result_type keyword you passed.
But it is a breaking change compared to 0.22 (although, I suppose this is rather a corner case .. returning series and using broadcast=True)

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 6, 2018

If we decide to not broadcast a Series to the original layout, I think we should at least raise an error for that case, and not silently ignore the result_type keyword you passed.
But it is a breaking change compared to 0.22 (although, I suppose this is rather a corner case .. returning series and using broadcast=True)

this already raises (an incorrect shape whether the result is a list of Series), added a test. (matches 0.22.0)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 7, 2018

Yes, if it has an incorrect shape, then it already raises (just like lists). But my example has a correct length, only different index names.

Using your example above:

In [1]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1, columns=['A', 'B', 'C'])
   ...: 

In [2]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[2]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [3]: df.apply(lambda x: Series([1, 2, 3], index=list('abc')), axis=1, result_type='broadcast')
Out[3]: 
   a  b  c
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

I think the above is not the desired result. IMO one of the goals of result_type='broadcast should be to ensure that you can be 100% sure what the result will look like (i.e. dataframe with the original column names).
I don't really care about this case, so raising an error instead of preserving the original column names when the returned Series has different index names (the current behaviour on master for broadcast=True) is fine for me as well. I just think we should not silently ignore result_type='broadcast'. If you want the result as shown above, that is what the default of result_type=None already does, as well as the explicit result_type='expand'.

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 7, 2018

you think that [3] should actually be [2], IOW we ignore if the resulting index (if any) on a broadcastable? ok I think that's fine.

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 7, 2018

ok fixed up.

@jorisvandenbossche jorisvandenbossche merged commit 6b0c7e7 into pandas-dev:master Feb 7, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 7, 2018

I think that is our record of closed issues in one go :-)

Thanks @jreback !

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 7, 2018

I opened some follow-up issues: #19570, #19571
And a PR with some more doc clean-ups: #19573

jreback added a commit that referenced this pull request Feb 8, 2018

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment