Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: speed of getting blob.data for large files (as compared to GitPython) #752

Closed
jnareb opened this issue Dec 2, 2017 · 3 comments

Comments

@jnareb
Copy link

jnareb commented Dec 2, 2017

I have compared speed of equivalent to git show <revision>:<pathname> in both pygit2 and GitPython (the pure-Python implementation). In all other cases that I have tested pygit2 is faster, but for very large files git show / git cat-file equivalent is slower.

pygit2 code:

blob = repo.revparse_single(commit + ':' + path)
result = blob.data

GitPython code:

blob = repo.rev_parse(commit + ':' + path)
result = blob.data_stream.read()

Do you have any ideas why pygit2 is slower here?

P.S. would it be difficult to add streaming access?

@buhl
Copy link
Contributor

buhl commented Apr 19, 2020

I would be willing to take a look at implementing somethingt like streaming read support.
I have very little experice with C, CPython and pygit2 so I think this could be a nice way to get my feet wet.

Howerver,

@jnareb I know this is an old issue, but could you get a little into how you did these tests? What repo, which files?

Also have you looked into using the memoryview feature?

pygit2/src/blob.c

Lines 182 to 199 in 06c2fd5

static int
Blob_getbuffer(Blob *self, Py_buffer *view, int flags)
{
if (Object__load((Object*)self) == NULL) { return -1; } // Lazy load
return PyBuffer_FillInfo(view, (PyObject *) self,
(void *) git_blob_rawcontent(self->blob),
git_blob_rawsize(self->blob), 1, flags);
}
static PyBufferProcs Blob_as_buffer = {
(getbufferproc)Blob_getbuffer,
};
PyDoc_STRVAR(Blob__doc__, "Blob object.\n"
"\n"
"Blobs implement the buffer interface, which means you can get access\n"
"to its data via `memoryview(blob)` without the need to create a copy."
);

This is an old issue. Are we sure this is still a problem?

I did some tests on this repo with the biggest file in it.

In [1]: import pygit2
   ...: import os
   ...: r = pygit2.Repository("/home/user/repos/pygit2")

In [2]: %%timeit
    ...: b = memoryview(r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg"))
    ...: b.tobytes()
    ...: b.release()
    ...:
5.16 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %%timeit
    ...: b = r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg")
    ...: b.data
10.2 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
    ...: b = memoryview(r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfo
    ...: nt.svg"))
    ...: [_ for _ in b]
    ...: b.release()
16.4 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %%timeit
     ...: b = r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg")
     ...: [_ for _ in b.data]
19 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: #When you loop over each char memeoryview comes closer to the normal accesspattern

In [7]: #I suspect python becomes the bottle neck here

In [8]: b = r.revparse_single("2b98170a:docs/_themes/sphinx_rtd_theme/static/css/fonts/fontawesome-webfont.svg")

In [9]: b.size
Out[9]: 444379

Edit:
Forgot to mention that you use io.BytesIO on the blob objects data as well.

@bors-ltd
Copy link
Contributor

I'm digging up this old issue, but this nice feature isn't well documented. Sure you can see it when doing help(pygit2.Blob), but if most people are like me, they read the online docs, or use introspection, and may never notice it.

I thought the blob data was doomed to be loaded into memory, and I can store relatively big files in this repository (around 20 MB).

Just reusing that sentence:

Blobs implement the buffer interface, which means you can get access
to its data via `memoryview(blob)` without the need to create a copy.

with a short example maybe, would be enough. Would you like me to send a PR?

As for the issue itself, maybe the original poster didn't know about the buffer interface either. Maybe it should be closed.

jdavid added a commit that referenced this issue Sep 23, 2020
@jdavid
Copy link
Member

jdavid commented Sep 23, 2020

@bors-ltd Just made a commit to display this information in the docs. But didn't add an example (PRs welcome).

Closing this.

@jdavid jdavid closed this as completed Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants