Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce UTF-8 encoding #1381

Merged
merged 5 commits into from
Feb 12, 2024
Merged

Enforce UTF-8 encoding #1381

merged 5 commits into from
Feb 12, 2024

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Dec 27, 2023

Description of proposed changes

Enforce UTF-8 encoding when reading and writing files. Improve error messages when a non-UTF-8 file is used.

See commit messages for details.

Related issue(s)

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

@victorlin victorlin self-assigned this Dec 27, 2023
Copy link

codecov bot commented Dec 27, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (dd8a1cb) 67.55% compared to head (5762baa) 67.67%.

Files Patch % Lines
augur/sequence_traits.py 33.33% 2 Missing ⚠️
augur/export_v2.py 50.00% 1 Missing ⚠️
augur/import_/beast.py 50.00% 1 Missing ⚠️
augur/lbi.py 50.00% 1 Missing ⚠️
augur/measurements/export.py 50.00% 1 Missing ⚠️
augur/reconstruct_sequences.py 50.00% 1 Missing ⚠️
augur/traits.py 66.66% 1 Missing ⚠️
augur/tree.py 75.00% 1 Missing ⚠️
augur/utils.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1381      +/-   ##
==========================================
+ Coverage   67.55%   67.67%   +0.12%     
==========================================
  Files          69       69              
  Lines        7493     7518      +25     
  Branches     1844     1844              
==========================================
+ Hits         5062     5088      +26     
+ Misses       2159     2158       -1     
  Partials      272      272              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin
Copy link
Member Author

@tsibley this needs a bit more work but I'd like to get your opinion on the direction.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a small but reasonable start!

Being explicit about encodings is best. Once Python's UTF-8 mode is the default (not for a while) we could be less explicit, but it'd really still be better to declare "Augur is UTF-8" explicitly.

It might be helpful to enable the default encoding warning in conjunction with the test suite to help us find places that rely on the implicit default of the current locale's encoding.

Are there bits in particular you wanted feedback on?

augur/io/metadata.py Outdated Show resolved Hide resolved
@victorlin
Copy link
Member Author

@tsibley no, that's helpful! Just wanted to see if you also thought declaring "Augur is UTF-8" is a good idea. I'll update this PR with your suggestions.

@victorlin
Copy link
Member Author

victorlin commented Jan 24, 2024

It might be helpful to enable the default encoding warning in conjunction with the test suite to help us find places that rely on the implicit default of the current locale's encoding.

I added it to CI in 1b52deb and triggered a manual run to expose warnings. It's very noisy since much external code doesn't set encoding explicitly, but nonetheless there is still some Augur code that is flagged. Example:

site-packages/Bio/File.py:72: EncodingWarning: 'encoding' argument not specified
augur/filter/_run.py:140: EncodingWarning: 'encoding' argument not specified

I think the most I'll do here is use the CI run as reference and specify encoding explicitly for all warnings that arise from within the Augur codebase.

@victorlin victorlin force-pushed the victorlin/io-encoding branch 2 times, most recently from be00ef3 to 9325ca9 Compare January 25, 2024 18:29
@victorlin victorlin changed the title Improve file encoding errors Enforce UTF-8 encoding Jan 25, 2024
@victorlin victorlin marked this pull request as ready for review January 25, 2024 18:39
@victorlin victorlin requested a review from a team January 25, 2024 18:39
@tsibley tsibley self-requested a review February 7, 2024 19:39
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a13f3ef's message:

Maybe a similar central function should be used for pandas.read_csv().

For common/shared/required arguments to pandas.read_csv(), yeah. But if just for encoding/compression, we can pass read_csv() a filehandle returned by open_file() instead of a filename, right?

augur/io/file.py Outdated Show resolved Hide resolved
augur/io/file.py Show resolved Hide resolved
tests/functional/filter/cram/filter-file-encoding-error.t Outdated Show resolved Hide resolved
augur/filter/_run.py Outdated Show resolved Hide resolved
CHANGES.md Outdated Show resolved Hide resolved
@victorlin
Copy link
Member Author

victorlin commented Feb 9, 2024

For common/shared/required arguments to pandas.read_csv(), yeah. But if just for encoding/compression, we can pass read_csv() a filehandle returned by open_file() instead of a filename, right?

Good point. Will look into this.

EDIT: Implemented as f0d1177

@tsibley
Copy link
Member

tsibley commented Feb 10, 2024

Good point. Will look into this.

EDIT: Implemented as f0d1177

Ah, what I meant was turning:

pandas.read_csv(args.filename, ...)

into:

pandas.read_csv(open_file(args.filename), ...)

or the conceptual equivalent with with etc. That's a little bit easier if open_file() stops being a context manager itself and behaves more like open().

By opening the file for Pandas, we standardize the compression and encoding support for Augur.

augur/io/metadata.py Outdated Show resolved Hide resolved
This is the default used by xopen as of v1.3.0.¹ Make it explicit here.

Also, since this module is used for IO across the codebase, store the
default encoding of UTF-8 in a variable to be used in other parts of the
codebase in future commits.

¹ <https://github.com/pycompression/xopen/blob/v1.3.0/README.rst#changes>
This is common user error, so handle it explicitly with a helpful error
message.
This allows setting defaults and supporting various compression formats
all from one central function.

Maybe a similar central function should be used for pandas.read_csv().
Start with just an explicit encoding. Applied to all invocations of
read_csv().

The current motivation of this change is to enforce UTF-8 encoding.
Another way to address that motivation would be to use Augur's internal
open_file() in place of the first argument to read_csv() to handle
encoding options as well as supported compression formats - but that's
not a trivial change.
@victorlin
Copy link
Member Author

Ah, what I meant was...

Oh whoops, I registered that at first but forgot to write down the change in direction after I tried it out.

Replacing the first argument to read_csv() across the board isn't a simple change especially with read_metadata(), which opens the file multiple times and potentially returns a DataFrame chunk iterator.

Instead of trying to make that work, I thought it would be easiest to enforce UTF-8 through shared options.

I've summarized this in the commit message of a4601a2 (previously f0d1177).

@victorlin victorlin merged commit e2ca468 into master Feb 12, 2024
20 checks passed
@victorlin victorlin deleted the victorlin/io-encoding branch February 12, 2024 20:44
@tsibley
Copy link
Member

tsibley commented Feb 12, 2024

Ahhh, makes sense! Thanks for following up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants