Try to ignore binary files and detect proper encoding #240

danipozo · 2023-09-22T16:27:34Z

This is meant to solve #227. Tries to ignore binary files as well as detect encoding of non-UTF8 files using chardet. Can be complemented by this suggestion to fully solve the issue.

BTW this project is affected by the issue described in securisec/ripgrepy#16, might want to look into forking ripgrepy as it doesn't seem to be maintained.

kantord · 2023-09-22T17:30:51Z

This is meant to solve #227. Tries to ignore binary files as well as detect encoding of non-UTF8 files using chardet. Can be complemented by this suggestion to fully solve the issue.

BTW this project is affected by the issue described in securisec/ripgrepy#16, might want to look into forking ripgrepy as it doesn't seem to be maintained.

ok, we could fork ripgrepy, or perhaps it's possible for us to get rid of that dependency instead 🤔 I think ripgrep can directly generate JSON anyway.

kantord · 2023-09-22T17:32:47Z

seagoat/file.py

+            if detector.done: break
+        detector.close()
+        if detector.result['confidence'] < 0.4:
+            return 'bin'


hmm, so if the encoding cannot be detected, we assume that it is binary? I can see that this is a safe option in terms of not breaking the server, however if we have logic to make the server "resilient to errors" I would prefer to not do it here but in some generic place

Suggested change

return 'bin'

return 'utf-8'

If you are going to handle encoding errors in some other way, then better to always return detector.result['encoding'], as it is likelier to be the correct encoding than utf-8.

you are right, that makes sense to me! and then, it would also simplify the code as checking for the confidence would not be necessary anymore, right?

kantord · 2023-09-22T17:40:24Z

I would also appreciate a test case that breaks without this fix. I think it would be enough to change any existing test case and force writing a file that is not the same encoding as the other files.

I think that you could add an encoding as an argument here, and make it default to "utf-8"

SeaGOAT/tests/conftest.py

Lines 130 to 151 in b7d8e08

    
           def add_file_change_commit( 
        
               self, 
        
               file_name, 
        
               contents, 
        
               author, 
        
               commit_message, 
        
           ): 
        
               with open( 
        
                   os.path.join(self.working_dir, file_name), "w", encoding="utf-8" 
        
               ) as output_file: 
        
                   output_file.write(contents) 
        
               self.index.add([file_name]) 
        
               return self.index.commit( 
        
                   commit_message, 
        
                   author=author, 
        
                   committer=author, 
        
                   author_date=self.fake_commit_date, 
        
                   commit_date=self.fake_commit_date, 
        
                   skip_hooks=True, 
        
               ).hexsha

then maybe just change one of these usages of that function to use a windows encoding and leave a comment that this is explicitly to make sure that it support that encoding:

SeaGOAT/tests/test_chroma.py

Lines 116 to 144 in b7d8e08

    
           @pytest.mark.asyncio 
        
           async def test_considers_commit_messages(repo): 
        
               repo.add_file_change_commit( 
        
                   file_name="vehicles_1.txt", 
        
                   contents="the the the", 
        
                   author=repo.actors["John Doe"], 
        
                   commit_message="pizza tomato salami recipe", 
        
               ) 
        
               repo.add_file_change_commit( 
        
                   file_name="vehicles_2.txt", 
        
                   contents=".", 
        
                   author=repo.actors["John Doe"], 
        
                   commit_message="Add vehicle information", 
        
               ) 
        
               repo.add_file_change_commit( 
        
                   file_name="vehicles_2.txt", 
        
                   contents="", 
        
                   author=repo.actors["John Doe"], 
        
                   commit_message="Add vehicle information", 
        
               ) 
        
               seagoat = Engine(repo.working_dir) 
        
               seagoat.analyze_codebase() 
        
               my_query = "italian pomodoro pie with slices of cured meat" 
        
               seagoat.query(my_query) 
        
               await seagoat.fetch() 
        
               assert seagoat.get_results()[0].path == "vehicles_1.txt"

Adding a completely separate test could also be a good alternative

kantord · 2023-09-22T17:40:54Z

Also there seem to be some failures

danipozo · 2023-09-22T17:56:48Z

ok, we could fork ripgrepy, or perhaps it's possible for us to get rid of that dependency instead 🤔 I think ripgrep can directly generate JSON anyway.

Yes, actually ripgrepy calls rg --json for this functionality and parses its output straightforwardly.

kantord · 2023-09-24T09:41:09Z

I extracted this logic to a class because it's also used in Result. I added a test and it was still failing for the same reason in result.

danipozo · 2023-09-24T10:03:13Z

That was fast, thanks!

danipozo added 2 commits September 22, 2023 18:23

Try to ignore binary files

54016cb

Fix typo in README

316e86c

danipozo changed the title ~~Try to ignore binary files~~ Try to ignore binary files and detect proper encoding Sep 22, 2023

kantord reviewed Sep 22, 2023

View reviewed changes

kantord added 5 commits September 23, 2023 18:14

Merge branch 'main' into ignore-binary-files

225b742

fix: always detect a file encoding

af81c59

test: test that other encodings are supported

bdfb0fd

add FileReader

8dcaa9b

docs: document list of supported character encodings

737a06c

kantord merged commit 3b889bc into kantord:main Sep 24, 2023
5 checks passed

kantord mentioned this pull request Sep 24, 2023

Support encodings other than utf-8 #227

Closed

cori mentioned this pull request Sep 25, 2023

"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81" #250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to ignore binary files and detect proper encoding #240

Try to ignore binary files and detect proper encoding #240

danipozo commented Sep 22, 2023

kantord commented Sep 22, 2023

kantord Sep 22, 2023

danipozo Sep 22, 2023

kantord Sep 22, 2023

kantord commented Sep 22, 2023

kantord commented Sep 22, 2023

danipozo commented Sep 22, 2023

kantord commented Sep 24, 2023

danipozo commented Sep 24, 2023 •

edited

Loading

Try to ignore binary files and detect proper encoding #240

Try to ignore binary files and detect proper encoding #240

Conversation

danipozo commented Sep 22, 2023

kantord commented Sep 22, 2023

kantord Sep 22, 2023

Choose a reason for hiding this comment

danipozo Sep 22, 2023

Choose a reason for hiding this comment

kantord Sep 22, 2023

Choose a reason for hiding this comment

kantord commented Sep 22, 2023

kantord commented Sep 22, 2023

danipozo commented Sep 22, 2023

kantord commented Sep 24, 2023

danipozo commented Sep 24, 2023 • edited Loading

danipozo commented Sep 24, 2023 •

edited

Loading