New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libgit2 binary detection does not match core git #2176
Comments
It looks like libgit2 does fancy stuff with printable/non-printable characters. Core git just says "is there a NUL in the first 8K?". It works surprisingly well. |
Libgit2 has two modes that it uses, both borrowed from core git. One is the fancy detection and one is just to look for nulls. Libgit2 used to use the fancy mode all the time but I made it use the simple NUL byte check for diffs in order to match core git behavior. This just looks like a regression. |
@peff I seem to remember that git core does not look for PDF headers during binary detection. Is this a conscious decision or just something that's never been added? I feel like this is something that libgit2 (really, every function looking for binaryness) should do, because PDF's often elude binary detection since no matter how many bytes you look at, you can have a PDF that is large enough that the embedded fonts / images / whatever binary payload is past your detection. |
No, we don't do any file-type detection at all. Certainly the NUL-in-8K thing can be totally wrong; you can have a bunch of text followed by a bunch of binary crap. But I don't think it is a complaint we have ever seen on the list, so if it happens, I suspect it is relatively rare (and if you have a particular problem case, you can use gitattributes to override the auto-detection). By the way, I've been experimenting recently with using libicu in git-core to more accurately detect actual text for diffs (and do things like normalize encodings before comparing two pieces of text). It turns out that it's rather slow to analyze the whole file (like an order of magnitude more than doing the actual diff). "Cheating" by peeking at the headers would help with the performance, but covering every case would be hard (I guess you'd want to delegate to something like libmagic). |
This should be fixed with #2362, can you confirm @arthurschreiber ? |
I'll have to check. |
@arthurschreiber I'm going to close this unless you think that there might still be some issues here. |
This is something I found while playing around with Rugged:
testrepo.git
:cp -rf <path-to-libgit2>/tests/resources/testrepo.git/ .
git commit -am "test"
git show master
This will correctly list all binary files (like the
*.idx
and the other git object files) as binary:When using rugged to (roughly) do the same:
Some files are recognized as binary, while others are not:
//cc @arrbee Can you take a stab at this?
The text was updated successfully, but these errors were encountered: