Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libgit2 binary detection does not match core git #2176

Closed
arthurschreiber opened this issue Mar 9, 2014 · 7 comments
Closed

Libgit2 binary detection does not match core git #2176

arthurschreiber opened this issue Mar 9, 2014 · 7 comments

Comments

@arthurschreiber
Copy link
Member

This is something I found while playing around with Rugged:

  • Create a new repository.
  • Copy over the files from testrepo.git: cp -rf <path-to-libgit2>/tests/resources/testrepo.git/ .
  • Add and commit all files git commit -am "test"
  • Check the diff: git show master

This will correctly list all binary files (like the *.idx and the other git object files) as binary:

...
Binary files /dev/null and b/objects/08/b041783f40edfe12bb406c9c9a8a040177c125 differ
diff --git a/objects/13/85f264afb75a56a5bec74243be9b367ba4ca08 b/objects/13/85f264afb75a56a5bec74243be9b367ba4ca08
new file mode 100644
index 0000000..cedb2a2
Binary files /dev/null and b/objects/13/85f264afb75a56a5bec74243be9b367ba4ca08 differ
diff --git a/objects/18/1037049a54a1eb5fab404658a3a250b44335d7 b/objects/18/1037049a54a1eb5fab404658a3a250b44335d7
new file mode 100644
index 0000000..93a16f1
Binary files /dev/null and b/objects/18/1037049a54a1eb5fab404658a3a250b44335d7 differ
diff --git a/objects/18/10dff58d8a660512d4832e740f692884338ccd b/objects/18/10dff58d8a660512d4832e740f692884338ccd
new file mode 100644
index 0000000..ba0bfb3
...

When using rugged to (roughly) do the same:

require "rugged"

repo = Rugged::Repository.new(".")
diff = repo.branches["master"].target.tree.diff("4b825dc642cb6eb9a060e54bf8d69288fbee4904", reverse: true)

diff.deltas.each do |delta|
  puts "#{delta.old_file[:path]} -> #{delta.binary?}"
end

diff.each_line do |line|
  puts line.content
end

Some files are recognized as binary, while others are not:

--- /dev/null
+++ a/objects/b6/361fc6a97178d8fc8639fdeed71c775ab52593
diff --git b/objects/be/3563ae3f795b2b4353bcce3a527ad0a4f7f644 a/objects/be/3563ae3f795b2b4353bcce3a527ad0a4f7f644
new file mode 100644
index 0000000..0817229
--- /dev/null
+++ a/objects/be/3563ae3f795b2b4353bcce3a527ad0a4f7f644
@@ -0,0 +1,3 @@
x??Kj1D??)z?UB??-0??uV9????<#??????+?W<J???&8?/s??e???ȕKJ????S
?Rv??{??Q??r??Y?QN$H\E????=6?X5???K Fr)?(?dC??Ά?????j?s?}???9?c-?w8?o?\?r??I???:
l}F?W$Ds?ǣ??ٚOW?e?]V8-Ý??"U

\ No newline at end of file
diff --git b/objects/c4/7800c7266a2be04c571c04d5a6614691ea99bd a/objects/c4/7800c7266a2be04c571c04d5a6614691ea99bd
new file mode 100644
index 0000000..75f541f
--- /dev/null
+++ a/objects/c4/7800c7266a2be04c571c04d5a6614691ea99bd
@@ -0,0 +1,3 @@
x??Q
?0D??)?ʦ?I<?'?lR+?Fj??Eo??0<x?h???a ??]ș??XUl?PF)?z?4y?,\r    'S??-mI4
?Xh??&??F?}n+\???Y?-p|鷜oU?z;-??a??lt{????I?,:?o?R??cHK

\ No newline at end of file
diff --git b/objects/d0/7b0f9a8c89f1d9e74dc4fce6421dec5ef8a659 a/objects/d0/7b0f9a8c89f1d9e74dc4fce6421dec5ef8a659
new file mode 100644
index 0000000..f3b46b3
--- /dev/null
+++ a/objects/d0/7b0f9a8c89f1d9e74dc4fce6421dec5ef8a659
diff --git b/objects/d6/c93164c249c8000205dd4ec5cbca1b516d487f a/objects/d6/c93164c249c8000205dd4ec5cbca1b516d487f
new file mode 100644
index 0000000..a67d6e6

//cc @arrbee Can you take a stab at this?

@peff
Copy link
Member

peff commented Mar 10, 2014

It looks like libgit2 does fancy stuff with printable/non-printable characters. Core git just says "is there a NUL in the first 8K?". It works surprisingly well.

@arrbee
Copy link
Member

arrbee commented Mar 10, 2014

Libgit2 has two modes that it uses, both borrowed from core git. One is the fancy detection and one is just to look for nulls. Libgit2 used to use the fancy mode all the time but I made it use the simple NUL byte check for diffs in order to match core git behavior. This just looks like a regression.

@ethomson
Copy link
Member

@peff I seem to remember that git core does not look for PDF headers during binary detection. Is this a conscious decision or just something that's never been added?

I feel like this is something that libgit2 (really, every function looking for binaryness) should do, because PDF's often elude binary detection since no matter how many bytes you look at, you can have a PDF that is large enough that the embedded fonts / images / whatever binary payload is past your detection.

@peff
Copy link
Member

peff commented Mar 10, 2014

No, we don't do any file-type detection at all. Certainly the NUL-in-8K thing can be totally wrong; you can have a bunch of text followed by a bunch of binary crap. But I don't think it is a complaint we have ever seen on the list, so if it happens, I suspect it is relatively rare (and if you have a particular problem case, you can use gitattributes to override the auto-detection).

By the way, I've been experimenting recently with using libicu in git-core to more accurately detect actual text for diffs (and do things like normalize encodings before comparing two pieces of text). It turns out that it's rather slow to analyze the whole file (like an order of magnitude more than doing the actual diff). "Cheating" by peeking at the headers would help with the performance, but covering every case would be hard (I guess you'd want to delegate to something like libmagic).

@carlosmn
Copy link
Member

This should be fixed with #2362, can you confirm @arthurschreiber ?

@arthurschreiber
Copy link
Member Author

I'll have to check.

@ethomson
Copy link
Member

@arthurschreiber I'm going to close this unless you think that there might still be some issues here.

@ethomson ethomson closed this as completed Nov 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants