# BUG: np.vectorize() functions operate twice on the first element (as seen when the function modifies a mutable object) #8758

Open
opened this Issue Mar 8, 2017 · 7 comments

Projects
None yet
4 participants

### danjump commented Mar 8, 2017 • edited

 I have a list of dictionaries. I'm trying to use np.vectorize to apply a function that modifies dictionary elements for each dictionary in the list. The results seem to show that vectorize is acting twice on the first element. Is this a bug that can be fixed?(perhaps related to the fact that vectorize checks type on the first element?) Below are some example cases and output: A simple test case with no dictionary modifications: ```def fcn1(x): return x['b'] a = [{'b': 1} for _ in range(3) ] print(a) print(np.vectorize(fcn1)(a)) print(a, '\n\n')``` output: ``````[{'b': 1}, {'b': 1}, {'b': 1}] [1 1 1] [{'b': 1}, {'b': 1}, {'b': 1}] `````` Now modify the dictionary and see that the function is applied twice to the first element: ```def fcn2(x): x['b'] += 1 return x['b'] a = [{'b': 1} for _ in range(3) ] print(a) print(np.vectorize(fcn2)(a)) print(a, '\n\n')``` output: ``````[{'b': 1}, {'b': 1}, {'b': 1}] [3 2 2] [{'b': 3}, {'b': 2}, {'b': 2}] `````` Try a different modification to check consistency of the bug: ```def fcn3(x): x['b'] *= 2 return x['b'] a = [{'b': 1} for _ in range(3) ] print(a) print(np.vectorize(fcn3)(a)) print(a, '\n\n')``` output: ``````[{'b': 1}, {'b': 1}, {'b': 1}] [4 2 2] [{'b': 4}, {'b': 2}, {'b': 2}] `````` You can do the same thing without actually providing a return value (which is how i'm trying to use it in my use case): ```def fcn4(x): x['b'] += 1 a = [{'b': 1} for _ in range(3) ] print(a) np.vectorize(fcn4)(a) print(a, '\n\n')``` output: ``````[{'b': 1}, {'b': 1}, {'b': 1}] [{'b': 3}, {'b': 2}, {'b': 2}] `````` And by the way, there is nothing special about a list of length 3, you can change that and see the same behavior of only the first element being double-modified. I'm confirmed the behavior using both Numpy version 1.11.3 and 1.12.0 EDIT: I found a work-around that also confirms its a "testing the type on the first element" issue. If you specify the `otypes` argument, the first element doesn't get hit twice: ```def fcn(x): x['b'] += 1 return x['b'] a = [{'b': 1} for _ in range(3)] print a print np.vectorize(fcn, otypes=[dict])(a) print a, '\n\n'``` output: ``````[{'b': 1}, {'b': 1}, {'b': 1}] [2 2 2] [{'b': 2}, {'b': 2}, {'b': 2}] ``````
Author

### danjump commented Mar 8, 2017

 I just edited my post with a new test that seems to confirm checking the first item type IS what causes the problem
Member

### eric-wieser commented Mar 8, 2017 • edited

 Simpler testcase: ```a = np.array([1, 2, 3]) def f(x): print('got', x) return x fv = np.vectorize(f) y = fv(a)``` Gives: ``````got 1 got 1 got 2 got 3 ``````
Contributor

### warut-vijit commented Mar 8, 2017 • edited

 "If otypes is not specified, then a call to the function with the first argument will be used to determine the number of outputs. The results of this call will be cached if cache is True to prevent calling the function twice. However, to implement the cache, the original function must be wrapped which will slow down subsequent calls, so only do this if your function is expensive." https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html So it's well-documented behavior, but does seem counterintuitive. So maybe its more of an enhancement than a bug fix.
Author

### danjump commented Mar 8, 2017

 ah, clearly I didn't read all of the docs well enough. That does fix the issue of double calls but at the cost of slower execution. So I guess there is no way to prevent this double-calling while still maintaining performance?
Author

### danjump commented Mar 8, 2017 • edited

 for example, in `numpy/lib/function_base.py` in the `vectorize` class function `_get_ufunc_and_otypes()`, I would naively think you could modify these lines: ```inputs = [arg.flat[0] for arg in args] outputs = func(*inputs)``` to: ```#earlier import copy ... inputs = copy.deepcopy([arg.flat[0] for arg in args]) outputs = func(*inputs)``` And then you would not have to use the cacheing or specify otypes but I think you avoid hitting the actual mutable elements twice. But I'm ignorant to how much of a performance hit that would give compared to the cacheing. I'm just getting the sense that the caching behavior was designed with expensive function execution time in mind, not thinking about the case of a function modifying a mutable object. I would think it's potentially possible to accommodate mutable object modification without the performance hit of caching that had a long function execution time in mind.
Contributor

### warut-vijit commented Mar 8, 2017 • edited

 I think that for vectorized operations on arrays of very large objects, it might actually be more computationally intensive to make a deep copy of the first element than it would cost to apply the function. So it might be a good idea to have that functionality if the user doesn't specify `otype`, but then say in the documentation that performance may take a hit unless `otype` is specified for arrays of large objects. @eric-wieser what do you think?
Member

### shoyer commented Mar 8, 2017

 `copy.deepcopy()` is not safe on many inputs, so unfortunately that's not a viable option. The only way to fix this would be to rewrite the core of `vectorize`, which currently uses `numpy.frompyfunc` to create an actually numpy ufunc. An alternate inner loop would need to be created, similar to what we use for `np.apply_along_axis`. Unfortunately, to avoid performance degradation, I think we would need to do the loop in C (like `frompyfunc` does in the ufunc it constructs).