# Unicode Errors

It is not uncommon to attempt to load a file and to get a Unicode Error.

The [Python documentation][] notes that:

> The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal).

[Python documentation]: https://docs.python.org/3/howto/unicode.html

Some notes:
* The entire I/O system was rewritten for Python3. 
* In Python3, all texts, all strings, are in Unicode: the entire I/O system is in Unicode.
* In Unicode, there is a standard integer code for every character used in all languages. (65,535 possibilities in UTF-8.)
* ASCII is 7-bit and has 128 (0-127) characters; Latin-1 is 8-bit and has 256 encodings (0-255) -- Latin-1 is also known as `ISO-8859-1` encoding. 

My first step is to move one of the problematic files, a plain text version of Upton Sinclair's _Jungle_ into the `texts/` directory:

In [1]:
%ls texts

an.txt      [1m[31mhod.txt[m[m*    jungle.txt  [1m[31mmdg.txt[m[m*


Once it's in the directory, let's try opening it without reading it to a string:

In [2]:
the_file = open('texts/jungle.txt', 'r', encoding='utf-8')

While I'm pretty sure this is not going to work, let's confirm that things are wonky:

In [3]:
text = the_file.read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 278: invalid continuation byte

Specifying `encoding='utf-8'` when opening the file above isn't really necessary, since that is the default in Python. That is, Python assumes any text coming in is in UTF-8. One of the suggestions in the documentation was to use the `surrogateescape` error handler, but I found that this not only took a long time but produced no results. 

What I did find worked was to try alternate encodings. Since I knew that UTF-8 was one step after Latin-1, which was one step after ASCII, I thought I would back up until, with luck, I got results. I got results on the first try:

In [4]:
with open('texts/jungle.txt', 'r', encoding='latin-1') as the_file:
    jungle = the_file.read()

print(jungle[0:300])

THE JUNGLE By Upton Sinclair (1906) Chapter 1 It was four o'clock when the ceremony was over and the carriages began to arrive. There had been a crowd following all the way, owing to the exuberance of Marija Berczynskas. The occasion rested heavily upon Marija's broad shouldersÑit was her task to se


Success.

The code below was leftover from exploring Python's use of Unicode:

In [None]:
import sys
sys.maxunicode
sys.stdout.encoding
sys.getsizeof(mdg)