A recent discussion on the python-ideas mailing list made it clear that we (i.e. the core Python developers) need to provide some clearer guidance on how to handle text processing tasks that trigger exceptions by default in Python 3, but were previously swept under the rug by Python 2's blithe assumption that all files are encoded in "latin-1".
While we'll have something in the official docs before too long, this is my own preliminary attempt at summarising the options for processing text files, and the various trade-offs between them.
The obvious question to ask is what changed in Python 3 so that the common
approaches that developers used to use for text processing in Python 2 have
now started to throw UnicodeDecodeError
and UnicodeEncodeError
in
Python 3.
The key difference is that the default text processing behaviour in Python 3
aims to detect text encoding problems as early as possible - either when
reading improperly encoded text (indicated by UnicodeDecodeError
) or when
being asked to write out a text sequence that cannot be correctly represented
in the target encoding (indicated by UnicodeEncodeError
).
This contrasts with the Python 2 approach which allowed data corruption by
default and strict correctness checks had to be requested explicitly. That
could certainly be convenient when the data being processed was
predominantly ASCII text, and the occasional bit of data corruption was
unlikely to be even detected, let alone cause problems, but it's hardly a
solid foundation for building robust multilingual applications (as anyone
that has ever had to track down an errant UnicodeError
in Python 2 will
know).
However, Python 3 does provide a number of mechanisms for relaxing the default strict checks in order to handle various text processing use cases (in particular, use cases where "best effort" processing is acceptable, and strict correctness is not required). This article aims to explain some of them by looking at cases where it would be appropriate to use them.
Note that many of the features I discuss below are available in Python 2
as well, but you have to explicitly access them via the unicode
type
and the codecs
module. In Python 3, they're part of the behaviour of
the str
type and the open
builtin.
To process text effectively in Python 3, it's necessary to learn at least a tiny amount about Unicode and text encodings:
- Python 3 always stores text strings as sequences of Unicode code points. These are values in the range 0-0x10FFFF. They don't always correspond directly to the characters you read on your screen, but that distinction doesn't matter for most text manipulation tasks.
- To store text as binary data, you must specify an encoding for that text.
- The process of converting from a sequence of bytes (i.e. binary data) to a sequence of code points (i.e. text data) is decoding, while the reverse process is encoding.
- For historical reasons, the most widely used encoding is
ascii
, which can only handle Unicode code points in the range 0-0x7F (i.e. ASCII is a 7-bit encoding). - There are a wide variety of ASCII compatible encodings, which ensure that any appearance of a valid ASCII value in the binary data refers to the corresponding ASCII character.
- "utf-8" is becoming the preferred encoding for many applications, as it is an ASCII-compatible encoding that can encode any valid Unicode code point.
- "latin-1" is another significant ASCII-compatible encoding, as it maps byte values directly to the first 256 Unicode code points. (Note that Windows has it's own "latin-1" variant called cp1252, but, unlike the ISO "latin-1" implemented by the Python codec with that name, the Windows specific variant doesn't map all 256 possible byte values)
- There are also many ASCII incompatible encodings in widespread use, particularly in Asian countries (which had to devise their own solutions before the rise of Unicode) and on platforms such as Windows, Java and the .NET CLR, where many APIs accept text as UTF-16 encoded data.
- The
locale.getpreferredencoding()
call reports the encoding that Python will use by default for most operations that require an encoding (e.g. reading in a text file without a specified encoding). This is designed to aid interoperability between Python and the host operating system, but can cause problems with interoperability between systems (if encoding issues are not managed consistently). - The
sys.getfilesystemencoding()
call reports the encoding that Python will use by default for most operations that both require an encoding and involve textual metadata in the filesystem (e.g. determining the results ofos.listdir()
) - If you're a native English speaker residing in an English speaking country
(like me!) it's tempting to think "but Python 2 works fine, why are you
bothering me with all this Unicode malarkey?". It's worth trying to remember
that we're actually a minority on this planet and, for most people on Earth,
ASCII and
latin-1
can't even handle their name, let alone any other text they might want to write or process in their native language.
To help standardise various techniques for dealing with Unicode encoding and decoding errors, Python includes a concept of Unicode error handlers that are automatically invoked whenever a problem is encountered in the process of encoding or decoding text.
I'm not going to cover all of them in this article, but three are of particular significance:
strict
: this is the default error handler that just raisesUnicodeDecodeError
for decoding problems andUnicodeEncodeError
for encoding problems.surrogateescape
: this is the error handler that Python uses for most OS facing APIs to gracefully cope with encoding problems in the data supplied by the OS. It handles decoding errors by squirreling the data away in a little used part of the Unicode code point space (For those interested in more detail, see PEP 383). When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly. Just as this is useful for OS APIs, it can make it easier to gracefully handle encoding problems in other contexts.backslashreplace
: this is an encoding error handler that converts code points that can't be represented in the target encoding to the equivalent Python string numeric escape sequence. It makes it easy to ensure thatUnicodeEncodeError
will never be thrown, but doesn't lose much information while doing so losing (since we don't want encoding problems hiding error output, this error handler is enabled onsys.stderr
by default).
One alternative that is always available is to open files in binary mode and process them as bytes rather than as text. This can work in many cases, especially those where the ASCII markers are embedded in genuinely arbitrary binary data.
However, for both "text data with unknown encoding" and "text data with known
encoding, but potentially containing encoding errors", it is often
preferable to get them into a form that can be handled as text strings. In
particular, some APIs that accept both bytes and text may be very strict
about the encoding of the bytes they accept (for example, the
urllib.urlparse
module accepts only pure ASCII data for processing as
bytes, but will happily process text strings containing non-ASCII
code points).
This section explores a number of use cases that can arise when processing text. Text encoding is a sufficiently complex topic that there's no one size fits all answer - the right answer for a given application will depend on factors like:
- how much control you have over the text encodings used
- whether avoiding program failure is more important than avoiding data corruption or vice-versa
- how common encoding errors are expected to be, and whether they need to be handled gracefully or can simply be rejected as invalid input
Use case: the files to be processed are in an ASCII compatible encoding, but you don't know exactly which one. All files must be processed without triggering any exceptions, but some risk of data corruption is deemed acceptable (e.g. collating log files from multiple sources where some data errors are acceptable, so long as the logs remain largely intact).
Approach: use the "latin-1" encoding to map byte values directly to the first 256 Unicode code points. This is the closest equivalent Python 3 offers to the permissive Python 2 text handling model.
Example: f = open(fname, encoding="latin-1")
Note
While the Windows cp1252
encoding is also sometimes referred to as
"latin-1", it doesn't map all possible byte values, and thus needs
to be used in combination with the surrogateescape
error handler to
ensure it never throws UnicodeDecodeError
. The latin-1
encoding
in Python implements ISO_8859-1:1987 which maps all possible byte values
to the first 256 Unicode code points, and thus ensures decoding errors
will never occur regardless of the configured error handler.
Consequences:
- data will not be corrupted if it is simply read in, processed as ASCII text, and written back out again.
- will never raise UnicodeDecodeError when reading data
- will still raise UnicodeEncodeError if codepoints above 0xFF (e.g. smart
quotes copied from a word processing program) are added to the text string
before it is encoded back to bytes. To prevent such errors, use the
backslashreplace
error handler (or one of the other error handlers that replaces Unicode code points without a representation in the target encoding with sequences of ASCII code points). - data corruption may occur if the source data is in an ASCII incompatible encoding (e.g. UTF-16)
- corruption may occur if data is written back out using an encoding other
than
latin-1
- corruption may occur if the non-ASCII elements of the string are modified
directly (e.g. for a variable width encoding like UTF-8 that has been
decoded as
latin-1
instead, slicing the string at an arbitrary point may split a multi-byte character into two pieces)
Use case: the files to be processed are in an ASCII compatible encoding, but you don't know exactly which one. All files must be processed without triggering any exceptions, but some Unicode related errors are acceptable in order to reduce the risk of data corruption (e.g. collating log files from multiple sources, but wanting more explicit notification when the collated data is at risk of corruption due to programming errors that violate the assumption of writing the data back out only in its original encoding)
Approach: use the ascii
encoding with the surrogateescape
error
handler.
Example: f = open(fname, encoding="ascii", errors="surrogateescape")
Consequences:
- data will not be corrupted if it is simply read in, processed as ASCII text, and written back out again.
- will never raise UnicodeDecodeError when reading data
- will still raise UnicodeEncodeError if codepoints above 0xFF (e.g. smart
quotes copied from a word processing program) are added to the text string
before it is encoded back to bytes. To prevent such errors, use the
backslashreplace
error handler (or one of the other error handlers that replaces Unicode code points without a representation in the target encoding with sequences of ASCII code points). - will also raise UnicodeEncodeError if an attempt is made to encode a text
string containing escaped bytes values without enabling the
surrogateescape
error handler (or an even more tolerant handler likebackslashreplace
). - some Unicode processing libraries that ensure a code point sequence is valid text may complain about the escaping mechanism used (I'm not going to explain what it means here, but the phrase "lone surrogate" is a hint that something along those lines may be happening - the fact that "surrogate" also appears in the name of the error handler is not a coincidence).
- data corruption may still occur if the source data is in an ASCII incompatible encoding (e.g. UTF-16)
- data corruption is also still possible if the escaped portions of the string are modified directly
Use case: the files to be processed are in a consistent encoding, the encoding can be determined from the OS details and locale settings and it is acceptable to refuse to process files that are not properly encoded.
Approach: simply open the file in text mode. This use case describes the default behaviour in Python 3.
Example: f = open(fname)
Consequences:
UnicodeDecodeError
may be thrown when reading such files (if the data is not actually in the encoding returned bylocale.getpreferredencoding()
)UnicodeEncodeError
may be thrown when writing such files (if attempting to write out code points which have no representation in the target encoding).- the
surrogateescape
error handler can be used to be more tolerant of encoding errors if it is necessary to make a best effort attempt to process files that contain such errors instead of rejecting them outright as invalid input.
Use case: the files to be processed are nominally in a consistent encoding, you know the exact encoding in advance and it is acceptable to refuse to process files that are not properly encoded. This is becoming more and more common, especially with many text file formats beginning to standardise on UTF-8 as the preferred text encoding.
Approach: open the file in text mode with the appropriate encoding
Example: f = open(fname, encoding="utf-8")
Consequences:
UnicodeDecodeError
may be thrown when reading such files (if the data is not actually in the specified encoding)UnicodeEncodeError
may be thrown when writing such files (if attempting to write out code points which have no representation in the target encoding).- the
surrogateescape
error handler can be used to be more tolerant of encoding errors if it is necessary to make a best effort attempt to process files that contain such errors instead of rejecting them outright as invalid input.
Use case: the files to be processed include markers that specify the nominal encoding (with a default encoding assumed if no marker is present) and it is acceptable to refuse to process files that are not properly encoded.
Approach: first open the file in binary mode to look for the encoding marker, then reopen in text mode with the identified encoding.
Example: f = tokenize.open(fname)
uses PEP 263 encoding markers to
detect the encoding of Python source files (defaulting to UTF-8 if no
encoding marker is detected)
Consequences:
- can handle files in different encodings
- may still raise UnicodeDecodeError if the encoding marker is incorrect
- must ensure marker is set correctly when writing such files
- even if it is not the default encoding, individual files can still be set to use UTF-8 as the encoding in order to support encoding almost all Unicode code points
- the
surrogateescape
error handler can be used to be more tolerant of encoding errors if it is necessary to make a best effort attempt to process files that contain such errors instead of rejecting them outright as invalid input.