-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of fromfile on Python 3 #13319
Comments
@pv, @juliantaylor - any ideas? The performance of |
Any updates or workaround for this issue? It would be really helpful. |
Answering to myself: using f.readinto(buf) as in the example above, or np.frombuffer() as suggested here, significantly reduces the issue of the slow np.fromfile() and mostly fixes the issue. |
Yes, the reference implementation |
This is to avoid poor performance of np.fromfile, see numpy/numpy#13319
This is to avoid poor performance of np.fromfile, see numpy/numpy#13319
The problem seems to be that The following is a script which generates a 10 MiB file and then optionally reads it in 512 byte chunks. I used # test.py
from argparse import ArgumentParser
import numpy as np
FILENAME = 'the_test_file'
def generate_file():
data = np.arange(10 * 2**18).astype('i4') # 10 MiB
with open(FILENAME, 'wb') as f:
f.write(data.tobytes())
def read_file():
with open(FILENAME, 'rb') as f:
for _ in range(10 * 2**11): # 512 byte chunks
x = np.fromfile(f, count=128, dtype='i4')
assert x[-1] == 10 * 2**18 - 1
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--read', action='store_true')
args = parser.parse_args()
if args.read:
generate_file()
read_file()
else:
generate_file()
open(FILENAME, 'rb').close() # this produces read calls on py3 First I run the script and inspect the resulting file:
Then I compare the
So this makes perfectly sense: even though I'm requesting Now with Python 3 the situation is quite different:
Here, the number of system calls is even greater than the number of requests via I further checked what is being requested from the file object by using a custom # test2.py
from __future__ import print_function
from collections import Counter, defaultdict
import io
import numpy as np
class BytesIO(io.BytesIO):
def __init__(self, fileno):
super(BytesIO, self).__init__()
self._requests = defaultdict(Counter)
self._fileno = fileno
self._trace = False
def __getattribute__(self, name):
if object.__getattribute__(self, '_trace'):
def _wrapper(*args):
object.__getattribute__(self, '_requests')[name][args] += 1
if name == 'fileno':
return object.__getattribute__(self, '_fileno')
else:
return super(BytesIO, self).__getattribute__(name)(*args)
return _wrapper
else:
return super(BytesIO, self).__getattribute__(name)
FILENAME = 'the_test_file'
def generate_file():
data = np.arange(10 * 2**18).astype('i4') # 10 MiB
with open(FILENAME, 'wb') as f:
f.write(data.tobytes())
def read_file():
with open(FILENAME, 'rb') as f:
b = BytesIO(f.fileno())
b.write(f.read())
b.seek(0)
b._trace = True
for _ in range(10 * 2**11): # 512 byte chunks
x = np.fromfile(b, count=128, dtype='i4')
assert x[-1] == 10 * 2**18 - 1
b._trace = False
return b
if __name__ == '__main__':
generate_file()
b = read_file()
print({k: sum(v.values()) for k,v in b._requests.items()}) Running this script reveals the following:
There is not a single call to any of the if __name__ == '__main__':
generate_file()
b = read_file()
s = np.array([x[0] for x in b._requests['seek']]) # on py2 this requires `sorted`
print(set(s[1:] - s[:-1])) This prints a set with a single item To answer why there are more
According to
There are different number of bytes being requested, always interleaved with
It's a little weird to see all these different numbers, but again, it appears that there is some interference between buffered and unbuffered reading, different interfaces fighting each other over the file descriptor. Perhaps someone who is familiar with the source code of I used |
Btw, this SO thread seems related https://stackoverflow.com/questions/71411907/dramatic-drop-in-numpy-fromfile-performance-when-switching-from-python-2-to-pyth |
I would be interested in how NumPy 1.23.x performs. The code was implemented in C in that release. |
that was loadtxt, not fromfile. Fromfile is always in C, but it currently does raw reads or so, I am sure that could be changed in principle. |
numpy.fromfile
is drastically inefficient for small reads on python 3; orders of magnitude slower than the same call on python 2. I believe this is because of changes made in response to #4118, keeping things in sync despite the IO buffering in python 3.Naively implementing a pure python version of
fromfile
reveals that better performance is available, even in python 3.x. Is it possible to improve the performance of numpy.fromfile to match such a reference implementation?Reproducing code example:
Running with any recent version of numpy on python 2.7 and 3.7 respectively gives results along the following lines:
Thus, on python 2.7,
numpy.fromfile
is as efficient as it can be (far more efficient than the pure python implementation), but on python 3.7 (and other 3.x)numpy.fromfile
drastically underperforms relative to the pure python implementation. This suggests a better implementation forfromfile
may be possible.Numpy/Python version information:
This can be reproduced on all recent versions of numpy as far as I can tell. However, the stats given above are from the following two setups:
python 2:
python 3:
The text was updated successfully, but these errors were encountered: