Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: np 1.18 wants explicit dtype=object) #30035

Closed
wants to merge 11 commits into from

Conversation

jbrockmendel
Copy link
Member

xref #30030, xref numpy/numpy#15041, numpy/numpy#15045

ATM we have 85 tests in the npdev build failing, this gets it down to 14 (at least for me locally).

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Dec 4, 2019
@jreback jreback added this to the 1.0 milestone Dec 4, 2019
with warnings.catch_warnings():
# See https://github.com/numpy/numpy/issues/15041
warnings.filterwarnings("ignore", ".*with automatic object dtype.*")
out = np.sum(list_of_columns, axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be replaced with a sep.join(...) with a list-comprehension?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe? ATM im just trying to get a handle on the scope of the problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will revert this for 1.18. PRobably also on master (but probably only temporarily to pacify downstream testsuits for a bit!). The interesting thing would be if there are cases where it cannot be fixed easily. I.e. adding a warning filter should not be necessary except for specific tests. (I am sure you are aware, but the filter context manager is not a good solution in this code, since it is not thread safe at all.)

@@ -1482,6 +1482,11 @@ def _format_strings(self) -> List[str]:
if is_categorical_dtype(values.dtype):
# Categorical is special for now, so that we can preserve tzinfo
array = values._internal_get_values()
elif values.dtype.kind == "O":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to use is_object_dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didnt check whether that works on JSONArray, just that values.dtype == object doesnt

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, ideally you can xfail the other instances that are not fixed here?

arr = np.array(arr_list)
tm.assert_numpy_array_equal(np.array(ujson.decode(ujson.encode(arr))), arr)
arr = np.array(arr_list, dtype=object)
result = np.array(ujson.decode(ujson.encode(arr)), dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an expected here

Comment on lines 151 to 154
with warnings.catch_warnings():
# See https://github.com/numpy/numpy/issues/15041
warnings.filterwarnings("ignore", ".*with automatic object dtype.*")
result = np.asarray(scalars, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed at the caller, not here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, pretty much every usage where we're just suppressing the warnings was a kludge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought for a way forward with the new sentinel in numpy/numpy#15119 is that you would leave this unchanged. When needed you would add the sentinel as the dtype in the call to _from_sequence (or even to whoever is calling _from_sequence. During the deprecation period, try to rework the use case that justifies the inefficient ragged array, and get back to NumPy with the use case for such a construct so we can stop or rethink the deprecation.

Comment on lines +1767 to 1770
if self.dtype.kind == "O":
# See https://github.com/numpy/numpy/issues/15041
return np.asarray(self.values, dtype=object)
return np.asarray(self.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is documented as being 1d only, this should be:

x = np.empty(len(self.values), dtype=self.dtype)
x[:] = self.values
return x

which will work correctly even if values == [[1, 2], [3, 4]]

@eric-wieser
Copy link
Contributor

I think we (cc @mattip) might not have communicated how to fix this problem well enough in NEP 34. As I see it, there are two cases when dealing with coercing object arrays:

  • You know exactly what shape or dimension you want the output to be, so pre-allocate with np.empty, then fill:
    def as_1d(x):
        arr = np.empty(len(x), dtype=object)
        arr[:] = x
        return arr
    
    Now admittedly, it's not very easy to write such a function for anything other than 0d and 1d. Maybe we need to solve that before we undo the reversion.
  • You don't know what shape the array is supposed to be. The caller is responsible for specifying this clearly, by performing the ambiguous conversion for you. Silencing the warning defeats the point of the deprecation:
    def do_some_things(arr, obj):
        arr = np.asarray(arr)   # if this emits a warning, it is the callers fault, not yours
    
    It's unfortunate that python doesn't make it easy to append context to a warning and propagate it, so you can explain to pandas users that they need to fix the warning.

@jreback
Copy link
Contributor

jreback commented Dec 8, 2019

@jbrockmendel can you rebase

@mattip
Copy link
Contributor

mattip commented Dec 8, 2019

NumPy is considering adding something like np.array(vals, dtype=np.legacy_behaviour) so libraries (like pandas) could use that to keep the old behaviour. The idea would be that NumPy could add an additional API as mentioned in the Alternatives section in the NEP, or libraries could work out strategies that avoid using NumPy for this kind of container, and until then you could continue using the old behaviour.

@jreback
Copy link
Contributor

jreback commented Dec 8, 2019

NumPy is considering adding something like np.array(vals, dtype=np.legacy_behaviour) so libraries (like pandas) could use that to keep the old behaviour. The idea would be that NumPy could add an additional API as mentioned in the Alternatives section in the NEP, or libraries could work out strategies that avoid using NumPy for this kind of container, and until then you could continue using the old behaviour.

thanks @mattip

yeah we likely need a combination of strategies here; e.g. to fix where we can be sure of what the input is and suppress / remove where needed.

@@ -183,6 +185,15 @@ def concat_categorical(to_concat, axis: int = 0):
return result


def _safe_array(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i actually like this as a real function; can you move to pandas.compat.numpy and use everywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this one

@@ -78,7 +78,7 @@ def data_for_sorting(allow_in_pandas, dtype):
if dtype.numpy_dtype == "object":
# Use an empty tuple for first element, then remove,
# to disable np.array's shape inference.
return PandasArray(np.array([(), (2,), (3,), (1,)])[1:])
return PandasArray(np.array([(), (2,), (3,), (1,)], dtype=object)[1:])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger does the comment above about shape inference have any bearing on the inference here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using object dtype is still correct, and I think the comment is still correct.

If we want to get rid of the empty tuple / shape stuff, we could probably allocate the array with the right shape and set the values later

In [6]: a = np.empty((3,), dtype=object)

In [7]: a[:] = (2,), (3,), (1,)

@jreback
Copy link
Contributor

jreback commented Dec 12, 2019

@jbrockmendel can you rebase again

@jreback
Copy link
Contributor

jreback commented Dec 15, 2019

needs a rebase again

@seberg
Copy link
Contributor

seberg commented Dec 16, 2019

Just to make sure you guys are up to date. It is very likely that the Opt-In to the old behaviour will not be object, but rather a new sentinel object.

@mattip
Copy link
Contributor

mattip commented Dec 16, 2019

Just to make sure you guys are up to date. It is very likely that the Opt-In to the old behaviour will not be object, but rather a new sentinel object.

There is a difference between np.array([1, 2], dtype=object) and np.array([1, 2], dtype=np.sentinel_that_means_legacy_behaviour): the first will result in dtype=object, the second in dtype=int See numpy/numpy#15083

@jbrockmendel
Copy link
Member Author

Just to make sure you guys are up to date. It is very likely that the Opt-In to the old behaviour will not be object, but rather a new sentinel object.

@seberg @mattip thanks. Is there a way to re-enable the warning so I can see where we need to update things?

@mattip
Copy link
Contributor

mattip commented Dec 16, 2019

You could build numpy with numpy/numpy#15119. I just pushed it so you might want to wait to see if it passes all tests before trying it out.

@jbrockmendel
Copy link
Member Author

Closing, will work on this locally.

@jbrockmendel jbrockmendel deleted the cifix5 branch December 18, 2019 21:26
@seberg
Copy link
Contributor

seberg commented Dec 23, 2019

@jbrockmendel did you already make progress on this? I am curious if an np.allow_object flag to opt-in to the old behaviour solves the problem (also, assuming that we may want to remove it eventually). E.g. I could imagine that it may be convenient to be able to opt-in to the future error immediately.

In the best case, the cases that are not easily fixed are cases where np.allow_object can be pushed to the end-user as a (potentially) intermediate solution.

@jbrockmendel
Copy link
Member Author

did you already make progress on this? I

We've updated most of the places that were easy to just add an explicit "dtype=object".

I haven't gotten around to work through this locally as suggested here. Do you agree that would be a good next step (and if so, is that still the right branch to use?)?

@seberg
Copy link
Contributor

seberg commented Dec 23, 2019

Good question, mostly good to know for us right now. If you (and we) are fine with dtype=object allowing ragged object arrays and for you using object indiscriminately is actually OK we are fine (we may have to think about it again at NumPy). In most cases that can run into this (and the warning/error is incorrect), I would hope you can even do np.empty(..., dtype=object) and set the shape manually.
We need to figure out what downstream requires here to know what our realistic options are.

@jbrockmendel
Copy link
Member Author

and for you using object indiscriminately is actually OK

No, there are still many places where we currently call np.array without passing a dtype and depend on numpy's type inference. In many/most of those cases, we won't have pre-checked for non-raggedness.

There are a good chunk of those cases where we know we want 1D output, so as soon as we see a listlike element we know it'll end up object-dtype. Would passing ndim=1 in some cases where we can't specify dtype make this any easier?

@seberg
Copy link
Contributor

seberg commented Dec 23, 2019

@jbrockmendel adding an ndim keyword argument is one of the options yes. We would prefer to add only the things that are really needed though. We could also add the opposite (do not allow ragged) although that forbids using the allow_object name for other things in the future. In which case your pattern would be to use a try/except. The try/except idea seems only really useful though if: 1. You want to deprecate this also and 2. not doing anything (forcing the user to fix it up) is not an option.

Do you have a start with the particularly tough spots? Maybe we have to look at it from numpy to get a better feel for the pain in pandas. I would be happy if pandas is just fine with np.allow_object/ragged, but even then it would be nice if at some far future we could move away from supporting it entirely.

@jbrockmendel
Copy link
Member Author

Do you have a start with the particularly tough spots?

I just cloned this branch and will test against it shortly.

@jbrockmendel
Copy link
Member Author

i didnt get any of the expected errors on that branch. is there another one i should try?

@mattip
Copy link
Contributor

mattip commented Jan 13, 2020

xref numpy/numpy#15119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants