Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of fromfile on Python 3 #13319

Open
apontzen opened this issue Apr 13, 2019 · 8 comments
Open

Performance of fromfile on Python 3 #13319

apontzen opened this issue Apr 13, 2019 · 8 comments

Comments

@apontzen
Copy link

apontzen commented Apr 13, 2019

numpy.fromfile is drastically inefficient for small reads on python 3; orders of magnitude slower than the same call on python 2. I believe this is because of changes made in response to #4118, keeping things in sync despite the IO buffering in python 3.

Naively implementing a pure python version of fromfile reveals that better performance is available, even in python 3.x. Is it possible to improve the performance of numpy.fromfile to match such a reference implementation?

Reproducing code example:

import numpy as np
import contextlib
import time
import sys
import os

@contextlib.contextmanager
def timeme(label):
    start = time.time()
    yield
    end = time.time()
    diff = (end-start)*1000
    print("%s: %.1f ms"%(label.rjust(45), diff))

def py_fromfile(f, dtype, num):
    buf = np.empty(num, dtype)
    f.readinto(buf)
    return buf

def make_test_file():
    test_data = np.random.uniform(size=10**6)
    with open('testdata', 'wb') as f:
        test_data.tofile(f)

def remove_test_file():
    os.remove("testdata")

def time_read(num_chunks, reading_method):
    py_version = sys.version_info
    reading_name = reading_method.__name__
    test_label = 'Read in %d chunk(s), %s, py%d.%d'%(num_chunks, reading_name,
                                                 py_version[0], py_version[1])
    make_test_file()

    with timeme(test_label):
        with open('testdata', 'rb') as f:
            for i in range(num_chunks):
                reading_method(f, np.float32, 10**6//num_chunks)

    remove_test_file()

time_read(1, np.fromfile)
time_read(10000, np.fromfile)
time_read(10000, py_fromfile)

Running with any recent version of numpy on python 2.7 and 3.7 respectively gives results along the following lines:

          Read in 1 chunk(s), fromfile, py3.7: 2.0 ms
      Read in 10000 chunk(s), fromfile, py3.7: 128.0 ms
   Read in 10000 chunk(s), py_fromfile, py3.7: 16.3 ms

          Read in 1 chunk(s), fromfile, py2.7: 2.0 ms
      Read in 10000 chunk(s), fromfile, py2.7: 6.6 ms
   Read in 10000 chunk(s), py_fromfile, py2.7: 20.6 ms

Thus, on python 2.7, numpy.fromfile is as efficient as it can be (far more efficient than the pure python implementation), but on python 3.7 (and other 3.x) numpy.fromfile drastically underperforms relative to the pure python implementation. This suggests a better implementation for fromfile may be possible.

Numpy/Python version information:

This can be reproduced on all recent versions of numpy as far as I can tell. However, the stats given above are from the following two setups:

python 2:

1.16.2 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 16:24:02) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

python 3:

1.16.2 3.7.2 (default, Dec 29 2018, 00:00:04) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
@mhvk
Copy link
Contributor

mhvk commented Apr 14, 2019

@pv, @juliantaylor - any ideas? The performance of np.fromfile does seem quite slow for small pieces.

@emanuele
Copy link

emanuele commented Apr 2, 2021

Any updates or workaround for this issue? It would be really helpful.

@emanuele
Copy link

Answering to myself: using f.readinto(buf) as in the example above, or np.frombuffer() as suggested here, significantly reduces the issue of the slow np.fromfile() and mostly fixes the issue.

@apontzen
Copy link
Author

Yes, the reference implementation py_fromfile in the example code I posted above shows this workaround. However it is not quite as fast as numpy used to be on python 2.

adamreeve added a commit to adamreeve/npTDMS that referenced this issue Sep 5, 2021
This is to avoid poor performance of np.fromfile,
see numpy/numpy#13319
adamreeve added a commit to adamreeve/npTDMS that referenced this issue Oct 25, 2021
This is to avoid poor performance of np.fromfile,
see numpy/numpy#13319
@Dominik1123
Copy link

The problem seems to be that np.fromfile uses some low-level file reading interface and thus bypasses the higher-level buffering implemented in Python 3. Below is the code to check that behavior.

The following is a script which generates a 10 MiB file and then optionally reads it in 512 byte chunks. I used strace to check the read system calls that are produced by this:

# test.py

from argparse import ArgumentParser

import numpy as np


FILENAME = 'the_test_file'


def generate_file():
    data = np.arange(10 * 2**18).astype('i4')  # 10 MiB
    with open(FILENAME, 'wb') as f:
        f.write(data.tobytes())


def read_file():
    with open(FILENAME, 'rb') as f:
        for _ in range(10 * 2**11):  # 512 byte chunks
            x = np.fromfile(f, count=128, dtype='i4')
        assert x[-1] == 10 * 2**18 - 1


if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--read', action='store_true')
    args = parser.parse_args()
    if args.read:
        generate_file()
        read_file()
    else:
        generate_file()
        open(FILENAME, 'rb').close()  # this produces read calls on py3

First I run the script and inspect the resulting file:

$ python test.py 
$ python
Python 2.7.17 |Anaconda, Inc.| (default, Oct 21 2019, 19:04:46) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.stat('the_test_file').st_blksize
4096

Then I compare the read system calls with and without --read option:

$ strace --trace=read python test.py 2>&1 | wc -l
614
$ strace --trace=read python test.py --read 2>&1 | wc -l
3174
$ python
Python 2.7.17 |Anaconda, Inc.| (default, Oct 21 2019, 19:04:46) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> n = 3174 - 614
>>> filesize = 10 * 2**20
>>> filesize / float(n)
4096.0

So this makes perfectly sense: even though I'm requesting np.fromfile to read the file in 512 byte chunks, it effectively uses the buffer and produces only necessary system calls of the indicated blksize.

Now with Python 3 the situation is quite different:

$ strace --trace=read python test.py 2>&1 | wc -l
390
$ strace --trace=read python test.py --read 2>&1 | wc -l
38790
$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> n = 38790 - 390
>>> filesize = 10 * 2**20
>>> filesize / n
273.06666666666666
>>> n, 10*2**11
(38400, 20480)

Here, the number of system calls is even greater than the number of requests via np.fromfile.

I further checked what is being requested from the file object by using a custom io.BytesIO subclass as a file-like proxy (accounting for #2230):

# test2.py

from __future__ import print_function
from collections import Counter, defaultdict
import io

import numpy as np


class BytesIO(io.BytesIO):
    def __init__(self, fileno):
        super(BytesIO, self).__init__()
        self._requests = defaultdict(Counter)
        self._fileno = fileno
        self._trace = False

    def __getattribute__(self, name):
        if object.__getattribute__(self, '_trace'):
            def _wrapper(*args):
                object.__getattribute__(self, '_requests')[name][args] += 1
                if name == 'fileno':
                    return object.__getattribute__(self, '_fileno')
                else:
                    return super(BytesIO, self).__getattribute__(name)(*args)
            return _wrapper
        else:
            return super(BytesIO, self).__getattribute__(name)


FILENAME = 'the_test_file'


def generate_file():
    data = np.arange(10 * 2**18).astype('i4')  # 10 MiB
    with open(FILENAME, 'wb') as f:
        f.write(data.tobytes())


def read_file():
    with open(FILENAME, 'rb') as f:
        b = BytesIO(f.fileno())
        b.write(f.read())
        b.seek(0)
        b._trace = True
        for _ in range(10 * 2**11):  # 512 byte chunks
            x = np.fromfile(b, count=128, dtype='i4')
        assert x[-1] == 10 * 2**18 - 1
        b._trace = False
        return b


if __name__ == '__main__':
    generate_file()
    b = read_file()
    print({k: sum(v.values()) for k,v in b._requests.items()})

Running this script reveals the following:

$ python test2.py 
{'flush': 20480, 'fileno': 40960, 'tell': 20480, 'seek': 20480}

There is not a single call to any of the read* methods, however twice as many fileno requests than other methods. This suggests that np.fromfile is using some low-level file access via the file descriptor (half of the fileno requests); the other half plus tells are probably used to check how many bytes are still left for reading (something like os.fstat(b.fileno()).st_size - b.tell()). This already seems problematic because for Python 3 all buffering takes place in the high level io.BufferedReader implementation and hence read calls should go through that interface. Looking at the seek arguments reveals that every np.fromfile call seems to be accompanied by a seek operation:

if __name__ == '__main__':
    generate_file()
    b = read_file()
    s = np.array([x[0] for x in b._requests['seek']])  # on py2 this requires `sorted`
    print(set(s[1:] - s[:-1]))

This prints a set with a single item {512} which is exactly the requested chunk size via np.fromfile.

To answer why there are more read system calls than calls to np.fromfile I looked at the output of strace:

strace --trace=read python test.py --read 2> log_strace

According to strace --trace=read python test.py 2>&1 | wc -l the first 390 lines belong to import and non-read_file related stuff.

$ tail -n +390 log_strace | head
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 512) = 512
read(4, "\200\0\0\0\201\0\0\0\202\0\0\0\203\0\0\0\204\0\0\0\205\0\0\0\206\0\0\0\207\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 1024) = 1024
read(4, "\0\1\0\0\1\1\0\0\2\1\0\0\3\1\0\0\4\1\0\0\5\1\0\0\6\1\0\0\7\1\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 1536) = 1536
read(4, "\200\1\0\0\201\1\0\0\202\1\0\0\203\1\0\0\204\1\0\0\205\1\0\0\206\1\0\0\207\1\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 2048) = 2048
read(4, "\0\2\0\0\1\2\0\0\2\2\0\0\3\2\0\0\4\2\0\0\5\2\0\0\6\2\0\0\7\2\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 2560) = 2560

There are different number of bytes being requested, always interleaved with 4096 (the file's blksize which is used by the buffered file API). I also extracted the requested numbers of bytes:

$ awk 'match($0, /[0-9]+[)]/) { print substr($0, RSTART, RLENGTH-1) }' log_strace > log_bytes
$ python -q
>>> from collections import Counter
>>> from pprint import pprint
>>> import numpy as np
>>> pprint(sorted(Counter(np.loadtxt('log_bytes', dtype=int)[390:]).items()))
[(0, 40),
 (200, 40),
 (512, 2560),
 (1024, 2560),
 (1536, 2560),
 (2048, 2560),
 (2560, 2560),
 (3072, 2560),
 (3584, 2560),
 (4096, 20399)]

It's a little weird to see all these different numbers, but again, it appears that there is some interference between buffered and unbuffered reading, different interfaces fighting each other over the file descriptor.

Perhaps someone who is familiar with the source code of np.fromfile could comment on what's going on behind the scenes there?


I used numpy==1.16.6 on Python 2.7 and 3.9 but I also crosschecked with numpy==1.21.2 on Python 3.9.

@erwanp
Copy link

erwanp commented Aug 17, 2022

@charris
Copy link
Member

charris commented Aug 17, 2022

I would be interested in how NumPy 1.23.x performs. The code was implemented in C in that release.

@seberg
Copy link
Member

seberg commented Aug 17, 2022

I would be interested in how NumPy 1.23.x performs. The code was implemented in C in that release.

that was loadtxt, not fromfile. Fromfile is always in C, but it currently does raw reads or so, I am sure that could be changed in principle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants