Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for numpy.fromfunction induces an erroneous interpretation! #15726

Open
anoldmaninthesea opened this issue Mar 8, 2020 · 14 comments

Comments

@anoldmaninthesea
Copy link

The documentation related to numpy.fromfunction states:

numpy.fromfunction(function, shape, **kwargs)
Construct an array by executing a function over each coordinate.
The resulting array therefore has a value fn(x, y, z) at coordinate (x, y, z).

However, when I run the following:
f=lambda m,n: (m,n)
np.fromfunction(f,(6,6),dtype=int)

I don't obtain an array from a list of tuples, but instead this:

(array([[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4],
[5, 5, 5, 5, 5, 5]]), array([[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5]]))

@Qiyu8
Copy link
Member

Qiyu8 commented Mar 9, 2020

There is a misunderstanding to fromfunction, you think that f(x,y) is called multiple times in each coordinate iteration, But actually, f(x,y) is invoked only once.The input parameter is the coordinate array of each dimension. in your case,

m is [[0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5, 5]],
and n is [[0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5]],

so the result is correct.

@rossbar
Copy link
Contributor

rossbar commented Mar 10, 2020

I actually do think the docstring is a bit misleading:

Returns
-------
fromfunction : any
    The result of the call to `function` is passed back directly.
    Therefore the shape of `fromfunction` is completely determined by
    `function`.  If `function` returns a scalar value, the shape of
    `fromfunction` would not match the `shape` parameter.

The last sentence seems to be incorrect - if the function returns a scalar value, the shape of fromfunction will match the shape parameter. From the examples in the docstring:

>>> np.fromfunction(lambda i, j: i + j, (3, 3), dtype=int)
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

@Qiyu8
Copy link
Member

Qiyu8 commented Mar 10, 2020

@rossbar may be this is more precise ? what do you think.

If `function` returns a non-scalar value, the shape of
    `fromfunction` may not match the `shape` parameter.

also we can add anoldmaninthesea's example to docstring to demostrate this situation.

@WarrenWeckesser
Copy link
Member

FYI: I've seen this confusion before, in a question on stackoverflow: https://stackoverflow.com/questions/27612288/unexpected-result-numpy-fromfunction-with-constant-functions

@eric-wieser
Copy link
Member

eric-wieser commented Mar 10, 2020

The last sentence seems to be incorrect - if the function returns a scalar value, the shape of fromfunction will match the shape parameter. From the examples in the docstring:

@rossbar, that example is one where the function does not return a scalar value (because i and j are themselves not scalars).

@anoldmaninthesea
Copy link
Author

I would be pleased with a change in the documentation. The fromfunction can be a useful method as is... It's just the documentation that seems to induce the user into a mistake.

@rossbar
Copy link
Contributor

rossbar commented Mar 10, 2020

@rossbar, that example is one where the function does not return a scalar value (because i and j are themselves not scalars).

I see - then this was definitely confusing to me. Whether the function lambda i, j : i + j returns scalars depends on the inputs. My problem was that I had a mental model of the function being applied element-wise at each coordinate rather than treating the coordinate inputs as arrays.

It is additionally confusing in the context of the original example lambda i, j : (i, j) which always returns a sequence. With that as a baseline, the concept of "returns a scalar" becomes even murkier.

Thanks for clearing this up @eric-wieser , though I must say without your expert intervention I was clearly getting the wrong idea from the docstring itself.

Also thanks @WarrenWeckesser for the SO link. I think incorporating part of your explanation into the docstring would go a long clearing up (at least my) issues:

func is called just once, with array arguments.

@rossbar
Copy link
Contributor

rossbar commented Mar 10, 2020

@Qiyu8 I think adding an example of a function that returns a tuple is a good idea, though the original example is not necessarily a good candidate as it is equivalent to np.indices (which, upon closer inspection, is how the array-inputs to np.fromfunction are constructed):

numpy/numpy/core/numeric.py

Lines 1766 to 1767 in 68224f4

args = indices(shape, dtype=dtype)
return function(*args, **kwargs)

@jatin-code777
Copy link

In addition to this, would it be a good idea to have a function that does what we expect numpy.fromfunction to do?
This would call the provided function "no of elements in shape" (shape[0]*shape[1]...) times giving it tuples like so (0,0) (0, 1)...
It would also always return an np.array of shape shape where the value at (i, j)th index is function(i, j)

@jon-middleton
Copy link

I have another complaint. Consider the following example:

def foo(x,y): 
    return np.max([x+y, 0])
print(np.fromfunction(foo, (100, 100)))

Then NumPy throws a ValueError:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

So np.fromfunction doesn't seem to like conditionals in the function it takes as a parameter. Why is that? And is there a workaround?

@eric-wieser
Copy link
Member

You should be using np.maximum(x+y, 0) there

@gabri94
Copy link

gabri94 commented Apr 21, 2023

Hi, I'm having similar issues which might be related to this misunderstanding of the use of np.fromfunction:

I want to accelerate the costruction a a matrix like this:

los = np.zeros(shape=(len(nodes), len(np_points), dtype=np.uint8)
for i in range(nodes.shape[0]):
    for j in range(np_points.shape[1]):
         los[i,j] = viewsheds[i, np_points[j,0], np_points[j,1]]

Using np.fromfunction as follow:

los = np.fromfunction(lambda i,j: viewsheds[i,  np_points[j, 0], np_points[j,1]],
                       shape=(len(nodes), len(t_points)),
                       dtype=np.uint8)

np_points is another numpy array of size (n,2)

The matrix has the right size, but the content is kind of random and is not equal to the los matrix computed manually. Am I misunderstanding the usage of the function or there's an actual bug?

@matanox
Copy link

matanox commented Jan 26, 2024

It seems that both broadcasting and vectorization are taking place in calling the supplied function, as in the example code at the top of this issue, same as in the following example:

def distance(a ,b):
    print('I have been called')
    return abs(a - b)

def fill(stream: np.array, query: np.array):
    return np.fromfunction(
        lambda i, j: distance(stream[i], query[j]),
        (len(stream), len(query)), dtype=int)

fill(np.array(range(3)), np.array(range(10)))

The best way to understand it is looking at the source code of it, np.fromfunction simply feeds the given function with the array of indices implied from the shape provided, so perhaps the documentation could just say that more directly. This simply enables broadcasting and vectorization wherever the function passed (in this case the function distance passed along via the lambda trick) is using ufuncs over its inputs.

So for example in the above code, the lambda is called once, receiving the array of indices implied from the shape as a single numpy array, which is what indices() yields inside of fromfunction if you read its code. This single 3D numpy array of indices is destructured into the i and j variables of the lambda since the length of the first axis of this numpy array is 2, so the array of indices is seamlessly destructured into two variables, i and j, which in turn enables the function distance receiving the data matrices a and b built by these indices ― and then one call to distance is enough.

Observing that a and b are numpy objects, the python interpreter simply delegates the - and abs operations to numpy's corresponding ufuncs which operate on arrays. Each one of these two array operations can now be said to be "vectorized" in the sense that it applies a fixed operation onto arrays of data using efficient C code which employs tight loops of computation over the memory locations, which is precisely what numpy does for you when you apply something like - between two arrays.

To sum it up, a function performing arithmetic (distance) got repurposed to operate on arrays by applying np.fromfunction with a lambda expression.

And it happens to be that despite this "plan of execution" assigning and using orders of magnitude more main memory for the matrices a and b than what the two original input 1D arrays take up, this computation is way faster than looping to compute the cell values by hand through plain python code ― it computes the final target matrix much faster, especially as the length of the input arrays significantly grows.

@matanox
Copy link

matanox commented Jan 26, 2024

As it is actually related back to the documentation, when benchmarking, fill() is an order of magnitude and more faster than using np.vectorize(distance) while technically flipping the first argument into a column vector which the np.vectorize version requires. Looks like np.fromfunction really makes some vectorization happen whereas np.vectorize does not, which is almost a half-void opposite of what the naming of these functions brings to mind.

Is it really the case as its docs page says that ―

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop

It seems that the use of the term vectorization in np.vectorize's documentation goes against numpy's documentation glossary of the term vectorization.

I think either the documentation should be harmonized on this, or maybe more agressive compiler flags should be used inside vectorize if it were to be actually compiling the function it is being given and not only wrapping with python code around it.

As is, the flow for learning how to leverage performance vectorization ― around one's own python functions performing arithmetics ― feels almost trecherous as far as documentation and function names are concerned. Perhaps this can be made more enabling by more consistently harmonizing the documentation regarding the term vectorization ― be that term defined in a numpy-specific way, or generically, whichever best serves the cause of clarity about numpy.

It seems that numba have developed a pair of further vectorizing versions of the vectorize function concept, but I find that it will never be stable up until the time that numba becomes a core part of numpy, I have encountered numerous stability concerns (bad luck?) and can only guess that without a tight integration in the development and release cycles it's going to be shaky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants