Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray.shift incorrect for periods larger than length of array #23911

Closed
TomAugspurger opened this issue Nov 25, 2018 · 7 comments · Fixed by #23947
Closed

ExtensionArray.shift incorrect for periods larger than length of array #23911

TomAugspurger opened this issue Nov 25, 2018 · 7 comments · Fixed by #23947
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 25, 2018

In [9]: from pandas.tests.extension.decimal import to_decimal

In [10]: to_decimal([1, 2]).shift(4)
Out[10]:
DecimalArray(array([Decimal('NaN'), Decimal('NaN'), Decimal('NaN'), Decimal('NaN')],
      dtype=object))

The expected output is

>>> to_decimal(['NaN', 'NaN'])
DecimalArray(array([Decimal('NaN'), Decimal('NaN')], dtype=object))

similar to categorical

In [12]: pd.Categorical([1, 2]).shift(4)
Out[12]:
[NaN, NaN]
Categories (2, int64): [1, 2]

This will need a new base test in tests/extension/base/methods.py and a fix in

def shift(self, periods=1):

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 25, 2018
@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Effort Low labels Nov 25, 2018
@Devilmoon
Copy link

Hi,

I would like to try and tackle this issue to bite my teeth into the pandas codebase, however I'm not entirely sure about the expected behaviour of the shift method.

From what I understand, when shift is called, the first period elements (up to the array's len) become NaN, and if period is smaller than the len of the array the original len-period elements become the last len-period elements, respecting the condition that period+len-period == len of the original array.

Is this correct?

@TomAugspurger
Copy link
Contributor Author

@Devilmoon thanks. I think your summary is correct.

The implementation is correct when abs(period) < len(self). The only bug is when periods is larger (in absolute terms).

The issue is on

empty = self._from_sequence([self.dtype.na_value] * abs(periods),

That should be

        empty = self._from_sequence(
            [self.dtype.na_value] * min(abs(periods), len(self)),
            dtype=self.dtype
        )

@varadgunjal
Copy link
Contributor

@TomAugspurger I'm trying to grok the correct way to put in a test for this.

I want to understand why a new base test is required under tests/extension/base/methods.py - wouldn't test_container_shift also cover testing the DecimalArray (generated by to_decimal) by casting it to a Series? Couldn't we just add the relevant test cases under @pytest.mark.parametrize and be done? Or am I thinking about this wrong?

@TomAugspurger
Copy link
Contributor Author

We write base tests for things that should be true of all extension array implementations http://pandas-docs.github.io/pandas-docs-travis/extending.html#testing-extension-arrays

I think something like

def test_shift_large_periods(self, data, na_value):
    data = data[:5]
    result = data.shift(10)
    expected = type(data)._construct_from_sequence([na_value] * 10)
    self.assert_extension_array_equal(result, expected)

should work, but I haven't tested it. Maybe also test for periods=-10.

@varadgunjal
Copy link
Contributor

Understood. Different question, why then should it be in that particular file which contains tests under the BaseMethodsTest class that are meant for Series and DataFrame methods?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 26, 2018 via email

@varadgunjal
Copy link
Contributor

varadgunjal commented Nov 27, 2018

I followed a slightly different approach for the tests in the PR. Please let me know if that works.

TomAugspurger pushed a commit that referenced this issue Dec 9, 2018
* Fixing shift() for ExtensionArray #23911
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants