Enforce UTF-8 encoding #1381

victorlin · 2023-12-27T19:38:28Z

Description of proposed changes

Enforce UTF-8 encoding when reading and writing files. Improve error messages when a non-UTF-8 file is used.

See commit messages for details.

Related issue(s)

Checklist

Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2023-12-27T19:46:31Z

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (dd8a1cb) 67.55% compared to head (5762baa) 67.67%.

Files	Patch %	Lines
augur/sequence_traits.py	33.33%	2 Missing ⚠️
augur/export_v2.py	50.00%	1 Missing ⚠️
augur/import_/beast.py	50.00%	1 Missing ⚠️
augur/lbi.py	50.00%	1 Missing ⚠️
augur/measurements/export.py	50.00%	1 Missing ⚠️
augur/reconstruct_sequences.py	50.00%	1 Missing ⚠️
augur/traits.py	66.66%	1 Missing ⚠️
augur/tree.py	75.00%	1 Missing ⚠️
augur/utils.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1381      +/-   ##
==========================================
+ Coverage   67.55%   67.67%   +0.12%     
==========================================
  Files          69       69              
  Lines        7493     7518      +25     
  Branches     1844     1844              
==========================================
+ Hits         5062     5088      +26     
+ Misses       2159     2158       -1     
  Partials      272      272

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin · 2024-01-24T00:15:53Z

@tsibley this needs a bit more work but I'd like to get your opinion on the direction.

tsibley

This seems like a small but reasonable start!

Being explicit about encodings is best. Once Python's UTF-8 mode is the default (not for a while) we could be less explicit, but it'd really still be better to declare "Augur is UTF-8" explicitly.

It might be helpful to enable the default encoding warning in conjunction with the test suite to help us find places that rely on the implicit default of the current locale's encoding.

Are there bits in particular you wanted feedback on?

augur/io/metadata.py

victorlin · 2024-01-24T18:48:07Z

@tsibley no, that's helpful! Just wanted to see if you also thought declaring "Augur is UTF-8" is a good idea. I'll update this PR with your suggestions.

victorlin · 2024-01-24T23:20:45Z

It might be helpful to enable the default encoding warning in conjunction with the test suite to help us find places that rely on the implicit default of the current locale's encoding.

I added it to CI in 1b52deb and triggered a manual run to expose warnings. It's very noisy since much external code doesn't set encoding explicitly, but nonetheless there is still some Augur code that is flagged. Example:

site-packages/Bio/File.py:72: EncodingWarning: 'encoding' argument not specified
augur/filter/_run.py:140: EncodingWarning: 'encoding' argument not specified

I think the most I'll do here is use the CI run as reference and specify encoding explicitly for all warnings that arise from within the Augur codebase.

tsibley

From a13f3ef's message:

Maybe a similar central function should be used for pandas.read_csv().

For common/shared/required arguments to pandas.read_csv(), yeah. But if just for encoding/compression, we can pass read_csv() a filehandle returned by open_file() instead of a filename, right?

augur/io/file.py

tests/functional/filter/cram/filter-file-encoding-error.t

augur/filter/_run.py

CHANGES.md

victorlin · 2024-02-09T22:55:11Z

For common/shared/required arguments to pandas.read_csv(), yeah. But if just for encoding/compression, we can pass read_csv() a filehandle returned by open_file() instead of a filename, right?

Good point. Will look into this.

EDIT: Implemented as f0d1177

tsibley · 2024-02-10T01:04:24Z

Good point. Will look into this.

EDIT: Implemented as f0d1177

Ah, what I meant was turning:

pandas.read_csv(args.filename, ...)

into:

pandas.read_csv(open_file(args.filename), ...)

or the conceptual equivalent with with etc. That's a little bit easier if open_file() stops being a context manager itself and behaves more like open().

By opening the file for Pandas, we standardize the compression and encoding support for Augur.

augur/io/metadata.py

This is the default used by xopen as of v1.3.0.¹ Make it explicit here. Also, since this module is used for IO across the codebase, store the default encoding of UTF-8 in a variable to be used in other parts of the codebase in future commits. ¹ <https://github.com/pycompression/xopen/blob/v1.3.0/README.rst#changes>

This is common user error, so handle it explicitly with a helpful error message.

This allows setting defaults and supporting various compression formats all from one central function. Maybe a similar central function should be used for pandas.read_csv().

Start with just an explicit encoding. Applied to all invocations of read_csv(). The current motivation of this change is to enforce UTF-8 encoding. Another way to address that motivation would be to use Augur's internal open_file() in place of the first argument to read_csv() to handle encoding options as well as supported compression formats - but that's not a trivial change.

victorlin · 2024-02-12T20:38:48Z

Ah, what I meant was...

Oh whoops, I registered that at first but forgot to write down the change in direction after I tried it out.

Replacing the first argument to read_csv() across the board isn't a simple change especially with read_metadata(), which opens the file multiple times and potentially returns a DataFrame chunk iterator.

Instead of trying to make that work, I thought it would be easiest to enforce UTF-8 through shared options.

I've summarized this in the commit message of a4601a2 (previously f0d1177).

tsibley · 2024-02-12T23:30:41Z

Ahhh, makes sense! Thanks for following up.

victorlin self-assigned this Dec 27, 2023

victorlin requested a review from tsibley January 24, 2024 00:15

tsibley reviewed Jan 24, 2024

View reviewed changes

augur/io/metadata.py Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/io-encoding branch from e244009 to f263054 Compare January 24, 2024 22:54

victorlin force-pushed the victorlin/io-encoding branch 2 times, most recently from be00ef3 to 9325ca9 Compare January 25, 2024 18:29

victorlin changed the title ~~Improve file encoding errors~~ Enforce UTF-8 encoding Jan 25, 2024

victorlin force-pushed the victorlin/io-encoding branch from 9325ca9 to cd479fd Compare January 25, 2024 18:38

victorlin marked this pull request as ready for review January 25, 2024 18:39

victorlin requested a review from a team January 25, 2024 18:39

tsibley self-requested a review February 7, 2024 19:39

tsibley requested changes Feb 7, 2024

View reviewed changes

victorlin force-pushed the victorlin/io-encoding branch from cd479fd to c2ff05e Compare February 9, 2024 22:55

victorlin force-pushed the victorlin/io-encoding branch from c2ff05e to b695954 Compare February 9, 2024 23:24

victorlin requested a review from tsibley February 9, 2024 23:28

tsibley approved these changes Feb 10, 2024

View reviewed changes

tsibley reviewed Feb 10, 2024

View reviewed changes

augur/io/metadata.py Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/io-encoding branch from b695954 to 657d524 Compare February 12, 2024 19:55

victorlin added 5 commits February 12, 2024 11:56

Handle unicode decoding errors

43e06a8

This is common user error, so handle it explicitly with a helpful error message.

Use open_file() in place of open() across codebase

51a77a9

This allows setting defaults and supporting various compression formats all from one central function. Maybe a similar central function should be used for pandas.read_csv().

Update changelog

5762baa

victorlin force-pushed the victorlin/io-encoding branch from 657d524 to 5762baa Compare February 12, 2024 19:57

victorlin merged commit e2ca468 into master Feb 12, 2024
20 checks passed

victorlin deleted the victorlin/io-encoding branch February 12, 2024 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce UTF-8 encoding #1381

Enforce UTF-8 encoding #1381

victorlin commented Dec 27, 2023 •

edited

Loading

codecov bot commented Dec 27, 2023 •

edited

Loading

victorlin commented Jan 24, 2024

tsibley left a comment

victorlin commented Jan 24, 2024

victorlin commented Jan 24, 2024 •

edited

Loading

tsibley left a comment

victorlin commented Feb 9, 2024 •

edited

Loading

tsibley commented Feb 10, 2024

victorlin commented Feb 12, 2024

tsibley commented Feb 12, 2024

Enforce UTF-8 encoding #1381

Enforce UTF-8 encoding #1381

Conversation

victorlin commented Dec 27, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Dec 27, 2023 • edited Loading

Codecov Report

victorlin commented Jan 24, 2024

tsibley left a comment

Choose a reason for hiding this comment

victorlin commented Jan 24, 2024

victorlin commented Jan 24, 2024 • edited Loading

tsibley left a comment

Choose a reason for hiding this comment

victorlin commented Feb 9, 2024 • edited Loading

tsibley commented Feb 10, 2024

victorlin commented Feb 12, 2024

tsibley commented Feb 12, 2024

victorlin commented Dec 27, 2023 •

edited

Loading

codecov bot commented Dec 27, 2023 •

edited

Loading

victorlin commented Jan 24, 2024 •

edited

Loading

victorlin commented Feb 9, 2024 •

edited

Loading