Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: mean padding with np.pad may lead to different prepended and appended values #11216

Closed
adeak opened this issue May 31, 2018 · 4 comments
Closed

Comments

@adeak
Copy link
Contributor

adeak commented May 31, 2018

The bug was posted on the mailing list. It boils down to the fact that np.mean with mean padding recomputes the mean from the half-padded array for the append step. Mathematically this would be OK, but in case of finite precision and near-zero means the near-zero prepended values may bias the recomputed mean in a way that prepended and appended values are different. Stripped-down example:

>>> import numpy as np
>>> a = np.array([-1, 2, -1]) + np.array([0, 1e-12, 0], dtype=np.float64)
>>> a = np.pad(a, (1, 1), 'mean')
>>> print(a)
[ 3.33362967e-13 -1.00000000e+00  2.00000000e+00 -1.00000000e+00
  3.33399974e-13]

As you can see, the first and the last values (both of which should be the mean) differ on the order of the machine epsilon. The prepended value is the true mean of the original array, the appended value is the mean where the prepended value is also taken into account. It was decided on the mailing list that this is a bug.

@hmaarrfk
Copy link
Contributor

@eric-wieser can you chime in on whether or not this is a bug. When developing the "single copy" pad function, I was perplexed as to why the code was written in such a way that made this possible. This would help me write a few tests for: #11358 (comment)

@adeak
Copy link
Contributor Author

adeak commented Sep 16, 2018

@hmaarrfk if you read the last link in my bug report you'll get your answer.

@hmaarrfk
Copy link
Contributor

of course...

hmaarrfk added a commit to hmaarrfk/numpy that referenced this issue Sep 16, 2018
This keeps the example shown in
numpy#11216
as a test.

Previously, the statistics would be computed on the newly
extended array. This would cause the leading and lagging
edge of the array to be padded with different values.
@hmaarrfk
Copy link
Contributor

Force pushing really messes with github. @eric-wieser beat me to the punch by 10 hours:

xref #11961

eric-wieser pushed a commit to eric-wieser/numpy that referenced this issue Sep 17, 2018
The test is marked xfail right now as it is not fixed in master
seberg pushed a commit that referenced this issue Mar 25, 2019
* ENH: Add support for constant, edge, linear_ramp to new numpy.pad

Passes unit tests:
- TestConstant
- TestEdge
- TestZeroPadWidth
- TestLegacyVectorFunction
- TestNdarrayPadWidth
- TestUnicodeInput
- TestLinearRamp

* MAINT: Simplify diff / change order of functions

* MAINT: Revert to old handling of keyword-only arguments

* ENH: Add support for stat modes

* ENH: Add support for "reflect" mode

* MAINT: Remove _slice_column

* ENH: Add support for "symmetric" mode

* MAINT: Simplify mode "linear_ramp"

Creating the linear ramp as an array with 1-sized dimensions except
for the one given by `axis` allows implicit broadcasting to the needed
shape. This seems to be even a little bit faster that doing this by hand
and allows the simplicifaction of the algorithm.

Note: Profiling and optimization will be done again at a later stage.

* MAINT: Reorder arguments of a sum and fix typo

Addresses feedback raised in PR.

* ENH: Add support for "wrap" mode

This completes the first draft of the complete rewrite meaning all unit
tests should pass from this commit onwards.

* MAINT: Merge functions for "reflect" and "symmetric" mode

The set functions were nearly the same, apart from some index offsets.
Merging them reduces code duplication.

* TST: Add regression test for gh-11216

The rewrite in past commits fixed this bug.

* BUG: Fix edge case for _set_wrap_both when pad_amt contains 0.

And include test to protect against regression.

* MAINT: Simplify and optimize pad modes

Major changes & goals:

Don't deal with pad area in the front and back separately. This
modularity isn't needed and makes handling of the right edge more
awkward. All modes now deal with the left and right side at the same
time.

Move the creation of the linear ramps fully to its own function which
behaves like a vectorized version of linspace.

Separate calculation and application of the pad area where possible.
This means that _get_edges can be reused for _get_linear_ramps.

Combine _normalize_shape and _validate_lengths in a single function
which should handles common cases faster.

Add new mode "empty" which leaves the padded areas undefined.

Add documentation where it was missing.

* TST: Don't use np.empty in unit tests

* MAINT: Reorder workflow in numpy.pad and deal with empty dimensions

Only modes "constant" and "empty" can extend dimensions of size 0. Deal
with this edge case gracefully for all other modes either fail or
return empty array with padded non-zero dimensions.

Handle default values closer to their actual usage. And validate
keyword arguments that must be numbers.

* MAINT: Add small tweaks to control flow and documentation

* BUG: Ensure wrap mode works if right_pad is 0

* ENH: Use reduced region of interest for iterative padding

When padding multiple dimensions iteratively corner values are
unnecessarily overwritten multiple times. This function reduces the
working area for the first dimensions so that corners are excluded.

* MAINT: Restore original argument order in _slice_at_axis

* MAINT: Keep original error message of broadcast_to

* MAINT: Restore old behavior for non-number end_values.

* BENCH: Make the pad benchmark pagefault in setup

* ENH/TST: Preserve memory layout (order) of the input array

and add appropriate unit test.

* STY: Revert cosmetical changes to reduce diff

* MAINT: Pin dtype to float64 for np.pad's benchmarks

* MAINT: Remove redundant code path in _view_roi

* MAINT/TST: Provide proper error message for unsupported modes

and add appropriate unit test.

* STY: Keep docstrings consistent and fix typo.

* MAINT: Simplify logical workflow in pad

* MAINT: Remove dtype argument from _linear_ramp

The responsibility of rounding (but without type conversion) is not
really need in _linear_ramp and only makes it a little bit harder to
reason about.

* DOC: Add version tag to new argument "empty"

* MAINT: Default to C-order for padded arrays

unless the input is F-contiguous.

* MAINT: Name slice of original area consistently

for all arguments describing the same thing.

* STY: Reduce vertical space

* MAINT: Remove shape argument from _slice_at_axis

Simplifies calls to this function and the function itself.
Using `(...,)` instead should keep this unambiguous. This change is not
compatible with Python 2.7 which doesn't support this syntax outside
sequence slicing. If that is wanted one could use `(Ellipsis,)` instead.

* TST: Test if end_values of linear_ramp are exact

which was not given in the old implementation `_arange_ndarray`.

* DOC: Improve comments and wrap long line

* MAINT: Refactor index_pair to width_pair

Calling the right value an index is just plain wrong as it can't be used
as such.

* MAINT: Make _linear_ramp compatible with size=0

* MAINT: Don't rely on negative indices for slicing

Calculating the proper positive index of the start of the right pad area
makes it possible to omit the extra code paths for a width of 0. This
should make the code easier to reason about.

* MAINT: Skip calculation of right_stat if identical

If the input area for both sides is the same we don't need to calculate
it twice.

* TST: Adapt tests from gh-12789 to rewrite of pad

* TST: Add tests for mode "empty"

* TST: Test dtype persistence for all modes

* TST: Test exception for unsupported modes

* TST: Test repeated wrapping for each side

individually. Reaches some only partially covered if-statments in
_set_wrap_both.

* TST: Test padding of empty dimension with constant

* TST: Test if end_values of linear_ramp are exact

which was not given in the old implementation `_arange_ndarray`. (Was
accidentally overwritten during the last merge).

* TST: Test persistence of memory layout

Adapted from an older commit 3ac4d2a
which was accidentally overwritten during the last merge.

* MAINT: Simplify branching in _set_reflect_both

Reduce branching and try to make the calculation of the various indices
easier to understand.

* TST: Parametrize TestConditionalShortcuts class

* TST: Test empty dimension padding for all modes

* TST: Keep test parametrization ordered

Keep parametrization ordered, otherwise pytest-xdist might believe that
different tests were collected during parallelization causing test
failures.

* DOC: Describe performance improvement of np.pad

as well as the new mode "empty" in release notes (see gh-11358).

* DOC: Remove outdated / misleading notes

These notes are badly worded or actually misleading. For a better
explanation on how these functions work have a look at the context and
comments just above the lines calling these functions.
grlee77 pushed a commit to grlee77/numpy that referenced this issue Mar 26, 2019
* ENH: Add support for constant, edge, linear_ramp to new numpy.pad

Passes unit tests:
- TestConstant
- TestEdge
- TestZeroPadWidth
- TestLegacyVectorFunction
- TestNdarrayPadWidth
- TestUnicodeInput
- TestLinearRamp

* MAINT: Simplify diff / change order of functions

* MAINT: Revert to old handling of keyword-only arguments

* ENH: Add support for stat modes

* ENH: Add support for "reflect" mode

* MAINT: Remove _slice_column

* ENH: Add support for "symmetric" mode

* MAINT: Simplify mode "linear_ramp"

Creating the linear ramp as an array with 1-sized dimensions except
for the one given by `axis` allows implicit broadcasting to the needed
shape. This seems to be even a little bit faster that doing this by hand
and allows the simplicifaction of the algorithm.

Note: Profiling and optimization will be done again at a later stage.

* MAINT: Reorder arguments of a sum and fix typo

Addresses feedback raised in PR.

* ENH: Add support for "wrap" mode

This completes the first draft of the complete rewrite meaning all unit
tests should pass from this commit onwards.

* MAINT: Merge functions for "reflect" and "symmetric" mode

The set functions were nearly the same, apart from some index offsets.
Merging them reduces code duplication.

* TST: Add regression test for numpygh-11216

The rewrite in past commits fixed this bug.

* BUG: Fix edge case for _set_wrap_both when pad_amt contains 0.

And include test to protect against regression.

* MAINT: Simplify and optimize pad modes

Major changes & goals:

Don't deal with pad area in the front and back separately. This
modularity isn't needed and makes handling of the right edge more
awkward. All modes now deal with the left and right side at the same
time.

Move the creation of the linear ramps fully to its own function which
behaves like a vectorized version of linspace.

Separate calculation and application of the pad area where possible.
This means that _get_edges can be reused for _get_linear_ramps.

Combine _normalize_shape and _validate_lengths in a single function
which should handles common cases faster.

Add new mode "empty" which leaves the padded areas undefined.

Add documentation where it was missing.

* TST: Don't use np.empty in unit tests

* MAINT: Reorder workflow in numpy.pad and deal with empty dimensions

Only modes "constant" and "empty" can extend dimensions of size 0. Deal
with this edge case gracefully for all other modes either fail or
return empty array with padded non-zero dimensions.

Handle default values closer to their actual usage. And validate
keyword arguments that must be numbers.

* MAINT: Add small tweaks to control flow and documentation

* BUG: Ensure wrap mode works if right_pad is 0

* ENH: Use reduced region of interest for iterative padding

When padding multiple dimensions iteratively corner values are
unnecessarily overwritten multiple times. This function reduces the
working area for the first dimensions so that corners are excluded.

* MAINT: Restore original argument order in _slice_at_axis

* MAINT: Keep original error message of broadcast_to

* MAINT: Restore old behavior for non-number end_values.

* BENCH: Make the pad benchmark pagefault in setup

* ENH/TST: Preserve memory layout (order) of the input array

and add appropriate unit test.

* STY: Revert cosmetical changes to reduce diff

* MAINT: Pin dtype to float64 for np.pad's benchmarks

* MAINT: Remove redundant code path in _view_roi

* MAINT/TST: Provide proper error message for unsupported modes

and add appropriate unit test.

* STY: Keep docstrings consistent and fix typo.

* MAINT: Simplify logical workflow in pad

* MAINT: Remove dtype argument from _linear_ramp

The responsibility of rounding (but without type conversion) is not
really need in _linear_ramp and only makes it a little bit harder to
reason about.

* DOC: Add version tag to new argument "empty"

* MAINT: Default to C-order for padded arrays

unless the input is F-contiguous.

* MAINT: Name slice of original area consistently

for all arguments describing the same thing.

* STY: Reduce vertical space

* MAINT: Remove shape argument from _slice_at_axis

Simplifies calls to this function and the function itself.
Using `(...,)` instead should keep this unambiguous. This change is not
compatible with Python 2.7 which doesn't support this syntax outside
sequence slicing. If that is wanted one could use `(Ellipsis,)` instead.

* TST: Test if end_values of linear_ramp are exact

which was not given in the old implementation `_arange_ndarray`.

* DOC: Improve comments and wrap long line

* MAINT: Refactor index_pair to width_pair

Calling the right value an index is just plain wrong as it can't be used
as such.

* MAINT: Make _linear_ramp compatible with size=0

* MAINT: Don't rely on negative indices for slicing

Calculating the proper positive index of the start of the right pad area
makes it possible to omit the extra code paths for a width of 0. This
should make the code easier to reason about.

* MAINT: Skip calculation of right_stat if identical

If the input area for both sides is the same we don't need to calculate
it twice.

* TST: Adapt tests from numpygh-12789 to rewrite of pad

* TST: Add tests for mode "empty"

* TST: Test dtype persistence for all modes

* TST: Test exception for unsupported modes

* TST: Test repeated wrapping for each side

individually. Reaches some only partially covered if-statments in
_set_wrap_both.

* TST: Test padding of empty dimension with constant

* TST: Test if end_values of linear_ramp are exact

which was not given in the old implementation `_arange_ndarray`. (Was
accidentally overwritten during the last merge).

* TST: Test persistence of memory layout

Adapted from an older commit 3ac4d2a
which was accidentally overwritten during the last merge.

* MAINT: Simplify branching in _set_reflect_both

Reduce branching and try to make the calculation of the various indices
easier to understand.

* TST: Parametrize TestConditionalShortcuts class

* TST: Test empty dimension padding for all modes

* TST: Keep test parametrization ordered

Keep parametrization ordered, otherwise pytest-xdist might believe that
different tests were collected during parallelization causing test
failures.

* DOC: Describe performance improvement of np.pad

as well as the new mode "empty" in release notes (see numpygh-11358).

* DOC: Remove outdated / misleading notes

These notes are badly worded or actually misleading. For a better
explanation on how these functions work have a look at the context and
comments just above the lines calling these functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants