Performance of fromfile on Python 3 #13319

apontzen · 2019-04-13T09:54:55Z

numpy.fromfile is drastically inefficient for small reads on python 3; orders of magnitude slower than the same call on python 2. I believe this is because of changes made in response to #4118, keeping things in sync despite the IO buffering in python 3.

Naively implementing a pure python version of fromfile reveals that better performance is available, even in python 3.x. Is it possible to improve the performance of numpy.fromfile to match such a reference implementation?

Reproducing code example:

import numpy as np
import contextlib
import time
import sys
import os

@contextlib.contextmanager
def timeme(label):
    start = time.time()
    yield
    end = time.time()
    diff = (end-start)*1000
    print("%s: %.1f ms"%(label.rjust(45), diff))

def py_fromfile(f, dtype, num):
    buf = np.empty(num, dtype)
    f.readinto(buf)
    return buf

def make_test_file():
    test_data = np.random.uniform(size=10**6)
    with open('testdata', 'wb') as f:
        test_data.tofile(f)

def remove_test_file():
    os.remove("testdata")

def time_read(num_chunks, reading_method):
    py_version = sys.version_info
    reading_name = reading_method.__name__
    test_label = 'Read in %d chunk(s), %s, py%d.%d'%(num_chunks, reading_name,
                                                 py_version[0], py_version[1])
    make_test_file()

    with timeme(test_label):
        with open('testdata', 'rb') as f:
            for i in range(num_chunks):
                reading_method(f, np.float32, 10**6//num_chunks)

    remove_test_file()

time_read(1, np.fromfile)
time_read(10000, np.fromfile)
time_read(10000, py_fromfile)

Running with any recent version of numpy on python 2.7 and 3.7 respectively gives results along the following lines:

          Read in 1 chunk(s), fromfile, py3.7: 2.0 ms
      Read in 10000 chunk(s), fromfile, py3.7: 128.0 ms
   Read in 10000 chunk(s), py_fromfile, py3.7: 16.3 ms

          Read in 1 chunk(s), fromfile, py2.7: 2.0 ms
      Read in 10000 chunk(s), fromfile, py2.7: 6.6 ms
   Read in 10000 chunk(s), py_fromfile, py2.7: 20.6 ms

Thus, on python 2.7, numpy.fromfile is as efficient as it can be (far more efficient than the pure python implementation), but on python 3.7 (and other 3.x) numpy.fromfile drastically underperforms relative to the pure python implementation. This suggests a better implementation for fromfile may be possible.

Numpy/Python version information:

This can be reproduced on all recent versions of numpy as far as I can tell. However, the stats given above are from the following two setups:

python 2:

1.16.2 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 16:24:02) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

python 3:

1.16.2 3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]

The text was updated successfully, but these errors were encountered:

mhvk · 2019-04-14T01:37:28Z

@pv, @juliantaylor - any ideas? The performance of np.fromfile does seem quite slow for small pieces.

emanuele · 2021-04-02T22:34:46Z

Any updates or workaround for this issue? It would be really helpful.

emanuele · 2021-04-13T14:28:06Z

Answering to myself: using f.readinto(buf) as in the example above, or np.frombuffer() as suggested here, significantly reduces the issue of the slow np.fromfile() and mostly fixes the issue.

apontzen · 2021-04-13T15:46:46Z

Yes, the reference implementation py_fromfile in the example code I posted above shows this workaround. However it is not quite as fast as numpy used to be on python 2.

This is to avoid poor performance of np.fromfile, see numpy/numpy#13319

Dominik1123 · 2022-03-10T23:25:51Z

The problem seems to be that np.fromfile uses some low-level file reading interface and thus bypasses the higher-level buffering implemented in Python 3. Below is the code to check that behavior.

The following is a script which generates a 10 MiB file and then optionally reads it in 512 byte chunks. I used strace to check the read system calls that are produced by this:

# test.py

from argparse import ArgumentParser

import numpy as np


FILENAME = 'the_test_file'


def generate_file():
    data = np.arange(10 * 2**18).astype('i4')  # 10 MiB
    with open(FILENAME, 'wb') as f:
        f.write(data.tobytes())


def read_file():
    with open(FILENAME, 'rb') as f:
        for _ in range(10 * 2**11):  # 512 byte chunks
            x = np.fromfile(f, count=128, dtype='i4')
        assert x[-1] == 10 * 2**18 - 1


if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--read', action='store_true')
    args = parser.parse_args()
    if args.read:
        generate_file()
        read_file()
    else:
        generate_file()
        open(FILENAME, 'rb').close()  # this produces read calls on py3

First I run the script and inspect the resulting file:

$ python test.py 
$ python
Python 2.7.17 |Anaconda, Inc.| (default, Oct 21 2019, 19:04:46) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.stat('the_test_file').st_blksize
4096

Then I compare the read system calls with and without --read option:

$ strace --trace=read python test.py 2>&1 | wc -l
614
$ strace --trace=read python test.py --read 2>&1 | wc -l
3174
$ python
Python 2.7.17 |Anaconda, Inc.| (default, Oct 21 2019, 19:04:46) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> n = 3174 - 614
>>> filesize = 10 * 2**20
>>> filesize / float(n)
4096.0

So this makes perfectly sense: even though I'm requesting np.fromfile to read the file in 512 byte chunks, it effectively uses the buffer and produces only necessary system calls of the indicated blksize.

Now with Python 3 the situation is quite different:

$ strace --trace=read python test.py 2>&1 | wc -l
390
$ strace --trace=read python test.py --read 2>&1 | wc -l
38790
$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> n = 38790 - 390
>>> filesize = 10 * 2**20
>>> filesize / n
273.06666666666666
>>> n, 10*2**11
(38400, 20480)

Here, the number of system calls is even greater than the number of requests via np.fromfile.

I further checked what is being requested from the file object by using a custom io.BytesIO subclass as a file-like proxy (accounting for #2230):

# test2.py

from __future__ import print_function
from collections import Counter, defaultdict
import io

import numpy as np


class BytesIO(io.BytesIO):
    def __init__(self, fileno):
        super(BytesIO, self).__init__()
        self._requests = defaultdict(Counter)
        self._fileno = fileno
        self._trace = False

    def __getattribute__(self, name):
        if object.__getattribute__(self, '_trace'):
            def _wrapper(*args):
                object.__getattribute__(self, '_requests')[name][args] += 1
                if name == 'fileno':
                    return object.__getattribute__(self, '_fileno')
                else:
                    return super(BytesIO, self).__getattribute__(name)(*args)
            return _wrapper
        else:
            return super(BytesIO, self).__getattribute__(name)


FILENAME = 'the_test_file'


def generate_file():
    data = np.arange(10 * 2**18).astype('i4')  # 10 MiB
    with open(FILENAME, 'wb') as f:
        f.write(data.tobytes())


def read_file():
    with open(FILENAME, 'rb') as f:
        b = BytesIO(f.fileno())
        b.write(f.read())
        b.seek(0)
        b._trace = True
        for _ in range(10 * 2**11):  # 512 byte chunks
            x = np.fromfile(b, count=128, dtype='i4')
        assert x[-1] == 10 * 2**18 - 1
        b._trace = False
        return b


if __name__ == '__main__':
    generate_file()
    b = read_file()
    print({k: sum(v.values()) for k,v in b._requests.items()})

Running this script reveals the following:

$ python test2.py 
{'flush': 20480, 'fileno': 40960, 'tell': 20480, 'seek': 20480}

There is not a single call to any of the read* methods, however twice as many fileno requests than other methods. This suggests that np.fromfile is using some low-level file access via the file descriptor (half of the fileno requests); the other half plus tells are probably used to check how many bytes are still left for reading (something like os.fstat(b.fileno()).st_size - b.tell()). This already seems problematic because for Python 3 all buffering takes place in the high level io.BufferedReader implementation and hence read calls should go through that interface. Looking at the seek arguments reveals that every np.fromfile call seems to be accompanied by a seek operation:

if __name__ == '__main__':
    generate_file()
    b = read_file()
    s = np.array([x[0] for x in b._requests['seek']])  # on py2 this requires `sorted`
    print(set(s[1:] - s[:-1]))

This prints a set with a single item {512} which is exactly the requested chunk size via np.fromfile.

To answer why there are more read system calls than calls to np.fromfile I looked at the output of strace:

strace --trace=read python test.py --read 2> log_strace

According to strace --trace=read python test.py 2>&1 | wc -l the first 390 lines belong to import and non-read_file related stuff.

$ tail -n +390 log_strace | head
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 512) = 512
read(4, "\200\0\0\0\201\0\0\0\202\0\0\0\203\0\0\0\204\0\0\0\205\0\0\0\206\0\0\0\207\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 1024) = 1024
read(4, "\0\1\0\0\1\1\0\0\2\1\0\0\3\1\0\0\4\1\0\0\5\1\0\0\6\1\0\0\7\1\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 1536) = 1536
read(4, "\200\1\0\0\201\1\0\0\202\1\0\0\203\1\0\0\204\1\0\0\205\1\0\0\206\1\0\0\207\1\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 2048) = 2048
read(4, "\0\2\0\0\1\2\0\0\2\2\0\0\3\2\0\0\4\2\0\0\5\2\0\0\6\2\0\0\7\2\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 2560) = 2560

There are different number of bytes being requested, always interleaved with 4096 (the file's blksize which is used by the buffered file API). I also extracted the requested numbers of bytes:

$ awk 'match($0, /[0-9]+[)]/) { print substr($0, RSTART, RLENGTH-1) }' log_strace > log_bytes
$ python -q
>>> from collections import Counter
>>> from pprint import pprint
>>> import numpy as np
>>> pprint(sorted(Counter(np.loadtxt('log_bytes', dtype=int)[390:]).items()))
[(0, 40),
 (200, 40),
 (512, 2560),
 (1024, 2560),
 (1536, 2560),
 (2048, 2560),
 (2560, 2560),
 (3072, 2560),
 (3584, 2560),
 (4096, 20399)]

It's a little weird to see all these different numbers, but again, it appears that there is some interference between buffered and unbuffered reading, different interfaces fighting each other over the file descriptor.

Perhaps someone who is familiar with the source code of np.fromfile could comment on what's going on behind the scenes there?

I used numpy==1.16.6 on Python 2.7 and 3.9 but I also crosschecked with numpy==1.21.2 on Python 3.9.

erwanp · 2022-08-17T14:49:58Z

Btw, this SO thread seems related https://stackoverflow.com/questions/71411907/dramatic-drop-in-numpy-fromfile-performance-when-switching-from-python-2-to-pyth

charris · 2022-08-17T15:26:36Z

I would be interested in how NumPy 1.23.x performs. The code was implemented in C in that release.

seberg · 2022-08-17T21:47:59Z

I would be interested in how NumPy 1.23.x performs. The code was implemented in C in that release.

that was loadtxt, not fromfile. Fromfile is always in C, but it currently does raw reads or so, I am sure that could be changed in principle.

apontzen mentioned this issue Apr 14, 2019

Ramses loading slower in Python 3.6.7 than 2.7.15 pynbody/pynbody#501

Closed

owena11 mentioned this issue Aug 19, 2020

Modify parsing of ff headers to avoid numpy.fromfile. SciTools/iris#3791

Merged

adamreeve mentioned this issue Sep 5, 2021

Reading huge amount of data takes a lot of time adamreeve/npTDMS#249

Open

adamreeve added a commit to adamreeve/npTDMS that referenced this issue Sep 5, 2021

Use file.readinto instead of np.fromfile when reading data

090ed79

This is to avoid poor performance of np.fromfile, see numpy/numpy#13319

adamreeve added a commit to adamreeve/npTDMS that referenced this issue Oct 25, 2021

Use file.readinto instead of np.fromfile when reading data

9e472c0

This is to avoid poor performance of np.fromfile, see numpy/numpy#13319

erwanp mentioned this issue Aug 17, 2022

🚀 Feature: write a Numba parser for HITEMP Files radis/radis#510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of fromfile on Python 3 #13319

Performance of fromfile on Python 3 #13319

apontzen commented Apr 13, 2019 •

edited

Loading

mhvk commented Apr 14, 2019

emanuele commented Apr 2, 2021

emanuele commented Apr 13, 2021

apontzen commented Apr 13, 2021

Dominik1123 commented Mar 10, 2022

erwanp commented Aug 17, 2022

charris commented Aug 17, 2022

seberg commented Aug 17, 2022

Performance of fromfile on Python 3 #13319

Performance of fromfile on Python 3 #13319

Comments

apontzen commented Apr 13, 2019 • edited Loading

Reproducing code example:

Numpy/Python version information:

mhvk commented Apr 14, 2019

emanuele commented Apr 2, 2021

emanuele commented Apr 13, 2021

apontzen commented Apr 13, 2021

Dominik1123 commented Mar 10, 2022

erwanp commented Aug 17, 2022

charris commented Aug 17, 2022

seberg commented Aug 17, 2022

apontzen commented Apr 13, 2019 •

edited

Loading