Generate `numpy.ndarray` from iterable with automatic memory pre-allocation #14479

prhbrt · 2019-09-11T10:15:52Z

I think a general use case of numpy.ndarrays is loading data from several files into one array, these could be pickles, images, or anything that can easily be loaded as a numpy.ndarray. In this case all these files would have the same shape and dtype, and would have normal strides. Is there an elegant way to load such files, avoiding duplicate code and memory overhead?

E.g. files created like this:

import numpy
from skimage.io import imsave, imread
from glob import glob

sz = (128, 128, 1)

for i, color in enumerate([[0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0]]):
  nice_image = numpy.clip(
      (0.95 + 0.1 * numpy.random.randn(*sz)) * [[color]],
      0, 1)
  imsave(
    f'nice-image-{i}.png',
    (255 * nice_image).astype(numpy.uint8)
  )

One way to open the files would be like this, but it would need twice the memory that is actually needed to first store the list and then create the array.

images = numpy.concatenate([
  imread(filename)[numpy.newaxis] for filename in filenames
], axis=0)

Alternatively:

first_image = imread(filenames[0])

images = numpy.zeros((len(filenames), ) + im0.shape, dtype=im0.dtype)
images[0] = first_image

for image, filename in zip(images[1:], filenames[1:]):
  image[...] = imread(filename)

But this has two places where the data is loaded, so duplicate code, which makes it non-transparent and easily introduces bugs.

Would it make sense for the numpy api to have a fromiter_nd for example:

images = numpy.fromiter_nd((
    imread(filename)[numpy.newaxis]
    for filename in filenames
), axis=0, length=len(filenames))

Existing functions fail in the following manor:

numpy.fromiter Assumes numbers and creates something 1D,
numpy.stack, numpy.concatenate, do not inspect the first item and use probably known generator length information to preallocate memory, and hence need twice the memory at peak.

The text was updated successfully, but these errors were encountered:

seberg · 2019-09-11T14:38:17Z

There is a or at least was a very old PR which suggested making fromiter handle ND objects, I think such addition should be acceptable.

If you know the exact shape ahead of time by inspecting the first and the length of the list (which sounds like you do), the typical thing is to allocate an empty array yourself and fill it up manually. If the concatenation is a bit more annoying, np.split could possibly elegant.

I suppose np.frombuffer with flattening and reshaping at the end would also work, but seems ugly to me (and could hide incorrect intermediate sizes).

prhbrt · 2019-09-13T09:29:43Z

@seberg Thanks. I see a lot of options, but I would say this should be something that requires a transparent and clean solution. IMHO reshaping to flat and back is something I only do it really necessarily, especially since the core idea of numpy is to have flat memory addressable with multiple indices via shapes, strides and shapes.

eric-wieser · 2020-08-15T12:28:56Z

There is a or at least was a very old PR which suggested making fromiter handle ND objects, I think such addition should be acceptable.

#5340 is that PR.

rossbar added the 33 - Question Question about NumPy usage or development label Jul 23, 2020

eric-wieser mentioned this issue Aug 15, 2020

Why is np.append so slow? #17090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate `numpy.ndarray` from iterable with automatic memory pre-allocation #14479

Generate `numpy.ndarray` from iterable with automatic memory pre-allocation #14479

prhbrt commented Sep 11, 2019 •

edited

seberg commented Sep 11, 2019

prhbrt commented Sep 13, 2019

eric-wieser commented Aug 15, 2020

Generate numpy.ndarray from iterable with automatic memory pre-allocation #14479

Generate numpy.ndarray from iterable with automatic memory pre-allocation #14479

Comments

prhbrt commented Sep 11, 2019 • edited

seberg commented Sep 11, 2019

prhbrt commented Sep 13, 2019

eric-wieser commented Aug 15, 2020

Generate `numpy.ndarray` from iterable with automatic memory pre-allocation #14479

Generate `numpy.ndarray` from iterable with automatic memory pre-allocation #14479

prhbrt commented Sep 11, 2019 •

edited