Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Fixed np.load time when file is large & compressed. #26509

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rajat315315
Copy link

Fixes: #26498

Instead of passing an object of zipfile.ZipExtFile to format.read_array(), I am now passing _io.BufferedReader whose read() is faster.

assert info.compress_type == 0
self.zip.fp.seek(
info.header_offset + len(info.FileHeader()) + 20
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small nitpicks is that NameToInfo seems undocumented, so I think using .getinfo() is maybe nicer (unfortunately, so is FileHeader()).

One worry is that without duplicating fp first this isn't thread-safe or is NpzFile so fundamentally not thread-safe that we don't have to worry about it?


I am have wondering if we could clean up the read_array to use fp.readinto(res_array) to safe unnecessary copies. Unfortunately, zipfile doesn't implement a specialized readinto so that only saves 1 of the two copies. (which still may be nice on its own.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe OP can answer about thread-safety as he seemed to introduce the code..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

np.load("a.npz") very slow when a.npz file very large
2 participants