Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to ignore binary files and detect proper encoding #240

Merged
merged 7 commits into from
Sep 24, 2023

Conversation

danipozo
Copy link
Contributor

This is meant to solve #227. Tries to ignore binary files as well as detect encoding of non-UTF8 files using chardet. Can be complemented by this suggestion to fully solve the issue.

BTW this project is affected by the issue described in securisec/ripgrepy#16, might want to look into forking ripgrepy as it doesn't seem to be maintained.

@danipozo danipozo changed the title Try to ignore binary files Try to ignore binary files and detect proper encoding Sep 22, 2023
@kantord
Copy link
Owner

kantord commented Sep 22, 2023

This is meant to solve #227. Tries to ignore binary files as well as detect encoding of non-UTF8 files using chardet. Can be complemented by this suggestion to fully solve the issue.

BTW this project is affected by the issue described in securisec/ripgrepy#16, might want to look into forking ripgrepy as it doesn't seem to be maintained.

ok, we could fork ripgrepy, or perhaps it's possible for us to get rid of that dependency instead 🤔 I think ripgrep can directly generate JSON anyway.

seagoat/file.py Outdated
if detector.done: break
detector.close()
if detector.result['confidence'] < 0.4:
return 'bin'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so if the encoding cannot be detected, we assume that it is binary? I can see that this is a safe option in terms of not breaking the server, however if we have logic to make the server "resilient to errors" I would prefer to not do it here but in some generic place

Suggested change
return 'bin'
return 'utf-8'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to handle encoding errors in some other way, then better to always return detector.result['encoding'], as it is likelier to be the correct encoding than utf-8.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, that makes sense to me! and then, it would also simplify the code as checking for the confidence would not be necessary anymore, right?

@kantord
Copy link
Owner

kantord commented Sep 22, 2023

I would also appreciate a test case that breaks without this fix. I think it would be enough to change any existing test case and force writing a file that is not the same encoding as the other files.

I think that you could add an encoding as an argument here, and make it default to "utf-8"

SeaGOAT/tests/conftest.py

Lines 130 to 151 in b7d8e08

def add_file_change_commit(
self,
file_name,
contents,
author,
commit_message,
):
with open(
os.path.join(self.working_dir, file_name), "w", encoding="utf-8"
) as output_file:
output_file.write(contents)
self.index.add([file_name])
return self.index.commit(
commit_message,
author=author,
committer=author,
author_date=self.fake_commit_date,
commit_date=self.fake_commit_date,
skip_hooks=True,
).hexsha

then maybe just change one of these usages of that function to use a windows encoding and leave a comment that this is explicitly to make sure that it support that encoding:

@pytest.mark.asyncio
async def test_considers_commit_messages(repo):
repo.add_file_change_commit(
file_name="vehicles_1.txt",
contents="the the the",
author=repo.actors["John Doe"],
commit_message="pizza tomato salami recipe",
)
repo.add_file_change_commit(
file_name="vehicles_2.txt",
contents=".",
author=repo.actors["John Doe"],
commit_message="Add vehicle information",
)
repo.add_file_change_commit(
file_name="vehicles_2.txt",
contents="",
author=repo.actors["John Doe"],
commit_message="Add vehicle information",
)
seagoat = Engine(repo.working_dir)
seagoat.analyze_codebase()
my_query = "italian pomodoro pie with slices of cured meat"
seagoat.query(my_query)
await seagoat.fetch()
assert seagoat.get_results()[0].path == "vehicles_1.txt"

Adding a completely separate test could also be a good alternative

@kantord
Copy link
Owner

kantord commented Sep 22, 2023

Also there seem to be some failures

@danipozo
Copy link
Contributor Author

ok, we could fork ripgrepy, or perhaps it's possible for us to get rid of that dependency instead 🤔 I think ripgrep can directly generate JSON anyway.

Yes, actually ripgrepy calls rg --json for this functionality and parses its output straightforwardly.

@kantord
Copy link
Owner

kantord commented Sep 24, 2023

I extracted this logic to a class because it's also used in Result. I added a test and it was still failing for the same reason in result.

@kantord kantord merged commit 3b889bc into kantord:main Sep 24, 2023
5 checks passed
@danipozo
Copy link
Contributor Author

danipozo commented Sep 24, 2023

That was fast, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants