Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.load("a.npz") very slow when a.npz file very large #26498

Open
AnnaTrainingG opened this issue May 22, 2024 · 9 comments · May be fixed by #26509
Open

np.load("a.npz") very slow when a.npz file very large #26498

AnnaTrainingG opened this issue May 22, 2024 · 9 comments · May be fixed by #26509

Comments

@AnnaTrainingG
Copy link

AnnaTrainingG commented May 22, 2024

Describe the issue:

np.load("a.npz") very slow when a.npz file very large

Reproduce the code example:

`import numpy as np
import time
x1 = np.arange(250000000, dtype=np.uint8)
x2 = np.arange(250000000, dtype=np.uint8)
x={"x1":x1,"x2":x2}
np.savez('/tmp/test.npz', row=x)
time0=time.time()
xx = np.load("/tmp/test.npz", allow_pickle=True)['row']
print(time.time()-time0)`
faster may be can use:
def load_from_npz(path, name):
    zf = zipfile.ZipFile(path)
    info = zf.NameToInfo[name + '.npy']
    assert info.compress_type == 0
    zf.fp.seek(info.header_offset + len(info.FileHeader()) + 20)

    encoding='ASCII'
    fix_imports=True
    pickle_kwargs = dict(encoding=encoding, fix_imports=fix_imports)
    array = np.lib.format.read_array(zf.fp, allow_pickle=True, pickle_kwargs=pickle_kwargs)
    return array

the different is the input of read_array, if use zf.open(name + '.npy'), read_array will be very slow

all test code :

import numpy as np
import zipfile
from numpy.compat import (
    isfileobj, os_fspath, pickle
    )
def load_from_npz(path, name):
    zf = zipfile.ZipFile(path)
    info = zf.NameToInfo[name + '.npy']
    assert info.compress_type == 0
    zf.fp.seek(info.header_offset + len(info.FileHeader()) + 20)

    encoding='ASCII'
    fix_imports=True
    pickle_kwargs = dict(encoding=encoding, fix_imports=fix_imports)
    array = np.lib.format.read_array(zf.fp, allow_pickle=True, pickle_kwargs=pickle_kwargs)
    return array

# create .npz file
import numpy as np
import time
x1 = np.arange(250000000, dtype=np.uint8)
x2 = np.arange(250000000, dtype=np.uint8)
x={"x1":x1,"x2":x2}
np.savez('/tmp/test.npz', row=x)
time0=time.time()
xx = np.load("/tmp/test.npz", allow_pickle=True)['row']
print("np base:", time.time()-time0)

time0=time.time()
xx = load_from_npz("/tmp/test.npz", 'row')
print("faster:", time.time()-time0)

Error message:

no

Python and NumPy Versions:

python3.9
numpy1.24.4

Runtime Environment:

No response

Context for the issue:

No response

@rkern
Copy link
Member

rkern commented May 22, 2024

There is actually a FIXME to that effect in the code. This is the place where you could implement this more efficient strategy:

https://github.com/numpy/numpy/blob/main/numpy/lib/_npyio_impl.py#L235-L263

@rkern rkern changed the title np.load("a.npz") very slow when a.npz file very large np.load("a.npz") very slow when unpickling a very large object array May 22, 2024
@rkern rkern changed the title np.load("a.npz") very slow when unpickling a very large object array np.load("a.npz") very slow when a.npz file very large May 22, 2024
@testhowtest
Copy link

#22916 proposes an enhancement to speed up numpy.load, which could potentially address our performance concerns. By improving loading speeds, we may be able to mitigate our issue. #2922 seems somewhat similar to this, involving the inability to load data larger than 2GB on 64-bit systems. Perhaps the root causes of both issues are related, and it might be worth exploring further.

@seberg
Copy link
Member

seberg commented May 23, 2024

Robert already linked to what seems to be the interesting part, the above seem both unrelated and any recent fixes/speedups would not affect the difference between the two paths shown.

@AnnaTrainingG
Copy link
Author

AnnaTrainingG commented May 26, 2024

BTW, If it's first time to read a.npz, this code will costs same time like before or slower than before. When I try to use multiThread to read this file, It will cost more time, I have no idea,

@seberg
Copy link
Member

seberg commented May 26, 2024

I have no idea,

File caches. Hard disks are slow, the two extra copies just don't matter in that case.

@AnnaTrainingG
Copy link
Author

Yes, the most time was cost for read from disk

@seberg
Copy link
Member

seberg commented May 27, 2024

The more I look at this, the more I think that ZipFile.open() should be improved to either return a "normal" file, or optimize the family of read functions (read and also readinto). If that was the case, the difference would likely just be gone without any code change in NumPy (for the pickle case, for the other case we may need to modernize the code to use fp.readinto()).

Doesn't mean we cannot add a work-around, but the PR isn't thread-safe, and I think it needs to be (even if I am not sure all of zipfile is).

@AnnaTrainingG
Copy link
Author

AnnaTrainingG commented May 28, 2024

Another solution is to use savez_compressed() and then load the data. In this case, reading the data will no longer be a problem, Decompression in picked. load will become the main issue
Will you provide a faster decompression implementation.

@AnnaTrainingG
Copy link
Author

#5976

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants