Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate numpy.ndarray from iterable with automatic memory pre-allocation #14479

Open
prhbrt opened this issue Sep 11, 2019 · 3 comments
Open
Labels
33 - Question Question about NumPy usage or development

Comments

@prhbrt
Copy link

prhbrt commented Sep 11, 2019

Colab

I think a general use case of numpy.ndarrays is loading data from several files into one array, these could be pickles, images, or anything that can easily be loaded as a numpy.ndarray. In this case all these files would have the same shape and dtype, and would have normal strides. Is there an elegant way to load such files, avoiding duplicate code and memory overhead?

E.g. files created like this:

import numpy
from skimage.io import imsave, imread
from glob import glob

sz = (128, 128, 1)

for i, color in enumerate([[0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0]]):
  nice_image = numpy.clip(
      (0.95 + 0.1 * numpy.random.randn(*sz)) * [[color]],
      0, 1)
  imsave(
    f'nice-image-{i}.png',
    (255 * nice_image).astype(numpy.uint8)
  )

One way to open the files would be like this, but it would need twice the memory that is actually needed to first store the list and then create the array.

images = numpy.concatenate([
  imread(filename)[numpy.newaxis] for filename in filenames
], axis=0)

Alternatively:

first_image = imread(filenames[0])

images = numpy.zeros((len(filenames), ) + im0.shape, dtype=im0.dtype)
images[0] = first_image

for image, filename in zip(images[1:], filenames[1:]):
  image[...] = imread(filename)

But this has two places where the data is loaded, so duplicate code, which makes it non-transparent and easily introduces bugs.

Would it make sense for the numpy api to have a fromiter_nd for example:

images = numpy.fromiter_nd((
    imread(filename)[numpy.newaxis]
    for filename in filenames
), axis=0, length=len(filenames))

Existing functions fail in the following manor:

  • numpy.fromiter Assumes numbers and creates something 1D,
  • numpy.stack, numpy.concatenate, do not inspect the first item and use probably known generator length information to preallocate memory, and hence need twice the memory at peak.
@seberg
Copy link
Member

seberg commented Sep 11, 2019

There is a or at least was a very old PR which suggested making fromiter handle ND objects, I think such addition should be acceptable.

If you know the exact shape ahead of time by inspecting the first and the length of the list (which sounds like you do), the typical thing is to allocate an empty array yourself and fill it up manually. If the concatenation is a bit more annoying, np.split could possibly elegant.

I suppose np.frombuffer with flattening and reshaping at the end would also work, but seems ugly to me (and could hide incorrect intermediate sizes).

@prhbrt
Copy link
Author

prhbrt commented Sep 13, 2019

@seberg Thanks. I see a lot of options, but I would say this should be something that requires a transparent and clean solution. IMHO reshaping to flat and back is something I only do it really necessarily, especially since the core idea of numpy is to have flat memory addressable with multiple indices via shapes, strides and shapes.

@rossbar rossbar added the 33 - Question Question about NumPy usage or development label Jul 23, 2020
@eric-wieser
Copy link
Member

There is a or at least was a very old PR which suggested making fromiter handle ND objects, I think such addition should be acceptable.

#5340 is that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
33 - Question Question about NumPy usage or development
Projects
None yet
Development

No branches or pull requests

4 participants