np.load("a.npz") very slow when a.npz file very large #26498

AnnaTrainingG · 2024-05-22T08:30:56Z

Describe the issue:

np.load("a.npz") very slow when a.npz file very large

Reproduce the code example:

`import numpy as np
import time
x1 = np.arange(250000000, dtype=np.uint8)
x2 = np.arange(250000000, dtype=np.uint8)
x={"x1":x1,"x2":x2}
np.savez('/tmp/test.npz', row=x)
time0=time.time()
xx = np.load("/tmp/test.npz", allow_pickle=True)['row']
print(time.time()-time0)`

faster may be can use:
def load_from_npz(path, name):
    zf = zipfile.ZipFile(path)
    info = zf.NameToInfo[name + '.npy']
    assert info.compress_type == 0
    zf.fp.seek(info.header_offset + len(info.FileHeader()) + 20)

    encoding='ASCII'
    fix_imports=True
    pickle_kwargs = dict(encoding=encoding, fix_imports=fix_imports)
    array = np.lib.format.read_array(zf.fp, allow_pickle=True, pickle_kwargs=pickle_kwargs)
    return array

the different is the input of read_array, if use zf.open(name + '.npy'), read_array will be very slow

all test code :

import numpy as np
import zipfile
from numpy.compat import (
    isfileobj, os_fspath, pickle
    )
def load_from_npz(path, name):
    zf = zipfile.ZipFile(path)
    info = zf.NameToInfo[name + '.npy']
    assert info.compress_type == 0
    zf.fp.seek(info.header_offset + len(info.FileHeader()) + 20)

    encoding='ASCII'
    fix_imports=True
    pickle_kwargs = dict(encoding=encoding, fix_imports=fix_imports)
    array = np.lib.format.read_array(zf.fp, allow_pickle=True, pickle_kwargs=pickle_kwargs)
    return array

# create .npz file
import numpy as np
import time
x1 = np.arange(250000000, dtype=np.uint8)
x2 = np.arange(250000000, dtype=np.uint8)
x={"x1":x1,"x2":x2}
np.savez('/tmp/test.npz', row=x)
time0=time.time()
xx = np.load("/tmp/test.npz", allow_pickle=True)['row']
print("np base:", time.time()-time0)

time0=time.time()
xx = load_from_npz("/tmp/test.npz", 'row')
print("faster:", time.time()-time0)

Error message:

no

Python and NumPy Versions:

python3.9
numpy1.24.4

Runtime Environment:

No response

Context for the issue:

No response

The text was updated successfully, but these errors were encountered:

rkern · 2024-05-22T14:22:45Z

There is actually a FIXME to that effect in the code. This is the place where you could implement this more efficient strategy:

https://github.com/numpy/numpy/blob/main/numpy/lib/_npyio_impl.py#L235-L263

testhowtest · 2024-05-23T06:24:57Z

#22916 proposes an enhancement to speed up numpy.load, which could potentially address our performance concerns. By improving loading speeds, we may be able to mitigate our issue. #2922 seems somewhat similar to this, involving the inability to load data larger than 2GB on 64-bit systems. Perhaps the root causes of both issues are related, and it might be worth exploring further.

seberg · 2024-05-23T07:18:24Z

Robert already linked to what seems to be the interesting part, the above seem both unrelated and any recent fixes/speedups would not affect the difference between the two paths shown.

AnnaTrainingG · 2024-05-26T04:53:17Z

BTW， If it's first time to read a.npz, this code will costs same time like before or slower than before. When I try to use multiThread to read this file, It will cost more time, I have no idea,

seberg · 2024-05-26T08:34:26Z

I have no idea,

File caches. Hard disks are slow, the two extra copies just don't matter in that case.

AnnaTrainingG · 2024-05-27T02:18:26Z

Yes, the most time was cost for read from disk

seberg · 2024-05-27T18:29:04Z

The more I look at this, the more I think that ZipFile.open() should be improved to either return a "normal" file, or optimize the family of read functions (read and also readinto). If that was the case, the difference would likely just be gone without any code change in NumPy (for the pickle case, for the other case we may need to modernize the code to use fp.readinto()).

Doesn't mean we cannot add a work-around, but the PR isn't thread-safe, and I think it needs to be (even if I am not sure all of zipfile is).

AnnaTrainingG · 2024-05-28T06:40:39Z

Another solution is to use savez_compressed() and then load the data. In this case, reading the data will no longer be a problem, Decompression in picked. load will become the main issue
Will you provide a faster decompression implementation.

AnnaTrainingG · 2024-05-31T06:42:20Z

#5976

AnnaTrainingG added the 00 - Bug label May 22, 2024

rgommers mentioned this issue May 22, 2024

np.load("a.npz") very slow when a.npz file very large #26499

Closed

rkern added 01 - Enhancement and removed 00 - Bug labels May 22, 2024

rkern changed the title ~~np.load("a.npz") very slow when a.npz file very large~~ np.load("a.npz") very slow when unpickling a very large object array May 22, 2024

rkern changed the title ~~np.load("a.npz") very slow when unpickling a very large object array~~ np.load("a.npz") very slow when a.npz file very large May 22, 2024

rajat315315 linked a pull request May 23, 2024 that will close this issue

FIX: Fixed np.load time when file is large & compressed. #26509

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

np.load("a.npz") very slow when a.npz file very large #26498

np.load("a.npz") very slow when a.npz file very large #26498

AnnaTrainingG commented May 22, 2024 •

edited

Loading

rkern commented May 22, 2024

testhowtest commented May 23, 2024

seberg commented May 23, 2024

AnnaTrainingG commented May 26, 2024 •

edited

Loading

seberg commented May 26, 2024

AnnaTrainingG commented May 27, 2024

seberg commented May 27, 2024

AnnaTrainingG commented May 28, 2024 •

edited

Loading

AnnaTrainingG commented May 31, 2024

np.load("a.npz") very slow when a.npz file very large #26498

np.load("a.npz") very slow when a.npz file very large #26498

Comments

AnnaTrainingG commented May 22, 2024 • edited Loading

Describe the issue:

Reproduce the code example:

Error message:

Python and NumPy Versions:

Runtime Environment:

Context for the issue:

rkern commented May 22, 2024

testhowtest commented May 23, 2024

seberg commented May 23, 2024

AnnaTrainingG commented May 26, 2024 • edited Loading

seberg commented May 26, 2024

AnnaTrainingG commented May 27, 2024

seberg commented May 27, 2024

AnnaTrainingG commented May 28, 2024 • edited Loading

AnnaTrainingG commented May 31, 2024

AnnaTrainingG commented May 22, 2024 •

edited

Loading

AnnaTrainingG commented May 26, 2024 •

edited

Loading

AnnaTrainingG commented May 28, 2024 •

edited

Loading