Performance: speed of getting `blob.data` for large files (as compared to GitPython) #752

jnareb · 2017-12-02T18:37:55Z

I have compared speed of equivalent to git show <revision>:<pathname> in both pygit2 and GitPython (the pure-Python implementation). In all other cases that I have tested pygit2 is faster, but for very large files git show / git cat-file equivalent is slower.

pygit2 code:

blob = repo.revparse_single(commit + ':' + path)
result = blob.data

GitPython code:

blob = repo.rev_parse(commit + ':' + path)
result = blob.data_stream.read()

Do you have any ideas why pygit2 is slower here?

P.S. would it be difficult to add streaming access?

The text was updated successfully, but these errors were encountered:

buhl · 2020-04-19T14:17:19Z

I would be willing to take a look at implementing somethingt like streaming read support.
I have very little experice with C, CPython and pygit2 so I think this could be a nice way to get my feet wet.

Howerver,

@jnareb I know this is an old issue, but could you get a little into how you did these tests? What repo, which files?

Also have you looked into using the memoryview feature?

pygit2/src/blob.c

Lines 182 to 199 in 06c2fd5

    
           static int 
        
           Blob_getbuffer(Blob *self, Py_buffer *view, int flags) 
        
           { 
        
               if (Object__load((Object*)self) == NULL) { return -1; } // Lazy load 
        
               return PyBuffer_FillInfo(view, (PyObject *) self, 
        
                                        (void *) git_blob_rawcontent(self->blob), 
        
                                        git_blob_rawsize(self->blob), 1, flags); 
        
           } 
        
           static PyBufferProcs Blob_as_buffer = { 
        
               (getbufferproc)Blob_getbuffer, 
        
           }; 
        
           PyDoc_STRVAR(Blob__doc__, "Blob object.\n" 
        
             "\n" 
        
             "Blobs implement the buffer interface, which means you can get access\n" 
        
             "to its data via `memoryview(blob)` without the need to create a copy." 
        
           );

This is an old issue. Are we sure this is still a problem?

I did some tests on this repo with the biggest file in it.

In [1]: import pygit2
   ...: import os
   ...: r = pygit2.Repository("/home/user/repos/pygit2")

In [2]: %%timeit
    ...: b = memoryview(r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg"))
    ...: b.tobytes()
    ...: b.release()
    ...:
5.16 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %%timeit
    ...: b = r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg")
    ...: b.data
10.2 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
    ...: b = memoryview(r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfo
    ...: nt.svg"))
    ...: [_ for _ in b]
    ...: b.release()
16.4 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %%timeit
     ...: b = r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg")
     ...: [_ for _ in b.data]
19 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: #When you loop over each char memeoryview comes closer to the normal accesspattern

In [7]: #I suspect python becomes the bottle neck here

In [8]: b = r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg")

In [9]: b.size
Out[9]: 444379

Edit:
Forgot to mention that you use io.BytesIO on the blob objects data as well.

bors-ltd · 2020-09-21T12:10:33Z

I'm digging up this old issue, but this nice feature isn't well documented. Sure you can see it when doing help(pygit2.Blob), but if most people are like me, they read the online docs, or use introspection, and may never notice it.

I thought the blob data was doomed to be loaded into memory, and I can store relatively big files in this repository (around 20 MB).

Just reusing that sentence:

Blobs implement the buffer interface, which means you can get access
to its data via `memoryview(blob)` without the need to create a copy.

with a short example maybe, would be enough. Would you like me to send a PR?

As for the issue itself, maybe the original poster didn't know about the buffer interface either. Maybe it should be closed.

Issue #752

jdavid · 2020-09-23T12:09:14Z

@bors-ltd Just made a commit to display this information in the docs. But didn't add an example (PRs welcome).

Closing this.

jdavid added a commit that referenced this issue Sep 23, 2020

docs: show some more info from docstrings

758a558

Issue #752

jdavid closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: speed of getting `blob.data` for large files (as compared to GitPython) #752

Performance: speed of getting `blob.data` for large files (as compared to GitPython) #752

jnareb commented Dec 2, 2017

buhl commented Apr 19, 2020 •

edited

Loading

bors-ltd commented Sep 21, 2020

jdavid commented Sep 23, 2020

Performance: speed of getting blob.data for large files (as compared to GitPython) #752

Performance: speed of getting blob.data for large files (as compared to GitPython) #752

Comments

jnareb commented Dec 2, 2017

buhl commented Apr 19, 2020 • edited Loading

bors-ltd commented Sep 21, 2020

jdavid commented Sep 23, 2020

Performance: speed of getting `blob.data` for large files (as compared to GitPython) #752

Performance: speed of getting `blob.data` for large files (as compared to GitPython) #752

buhl commented Apr 19, 2020 •

edited

Loading