-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to ignore binary files and detect proper encoding #240
Conversation
ok, we could fork ripgrepy, or perhaps it's possible for us to get rid of that dependency instead 🤔 I think ripgrep can directly generate JSON anyway. |
seagoat/file.py
Outdated
if detector.done: break | ||
detector.close() | ||
if detector.result['confidence'] < 0.4: | ||
return 'bin' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so if the encoding cannot be detected, we assume that it is binary? I can see that this is a safe option in terms of not breaking the server, however if we have logic to make the server "resilient to errors" I would prefer to not do it here but in some generic place
return 'bin' | |
return 'utf-8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to handle encoding errors in some other way, then better to always return detector.result['encoding']
, as it is likelier to be the correct encoding than utf-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, that makes sense to me! and then, it would also simplify the code as checking for the confidence would not be necessary anymore, right?
I would also appreciate a test case that breaks without this fix. I think it would be enough to change any existing test case and force writing a file that is not the same encoding as the other files. I think that you could add an Lines 130 to 151 in b7d8e08
then maybe just change one of these usages of that function to use a windows encoding and leave a comment that this is explicitly to make sure that it support that encoding: Lines 116 to 144 in b7d8e08
Adding a completely separate test could also be a good alternative |
Also there seem to be some failures |
Yes, actually ripgrepy calls |
I extracted this logic to a class because it's also used in Result. I added a test and it was still failing for the same reason in result. |
That was fast, thanks! |
This is meant to solve #227. Tries to ignore binary files as well as detect encoding of non-UTF8 files using
chardet
. Can be complemented by this suggestion to fully solve the issue.BTW this project is affected by the issue described in securisec/ripgrepy#16, might want to look into forking
ripgrepy
as it doesn't seem to be maintained.