# Unicode in Python
## A Continuous Battle

**Conventions:**

I'm going to say **"bytes"** or **"byte string"** to mean the Python 2 "str" type and the Python 3 "bytes" type.

I'm going to say **"unicode string"** to mean the Python 2 "unicode" type and the Python 3 "str" type.

I'm going to prefix my string literals with **b** or **u** to try to make it obvious which of the two I mean.

In [28]:
b"some text"  # this is a byte string
u"some text"  # this is a unicode string

u'some text'

In [29]:
from __future__ import print_function
byte_string = b"© 2017 OKCPython"
print(type(byte_string))
print(repr(byte_string))
print(byte_string)

<type 'str'>
'\xc2\xa9 2017 OKCPython'
© 2017 OKCPython


In [30]:
unicode(byte_string)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

# What's the correct way to convert the byte string to a unicode string?

In [32]:
byte_string.encode("latin-1")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

In [48]:
byte_string.encode("utf8")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

In [34]:
byte_string.decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

In [39]:
byte_string.decode("latin-1")

u'\xc2\xa9 2017 OKCPython'

In [40]:
print(byte_string.decode("latin-1"))

Â© 2017 OKCPython


In [41]:
byte_string.decode("utf8")

u'\xa9 2017 OKCPython'

In [51]:
unicode_string = byte_string.decode("utf8")
print(unicode_string)

© 2017 OKCPython


# What does it take to get back to a byte string?

In [52]:
print(str(unicode_string))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

In [53]:
print(unicode_string.encode("ascii"))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

In [54]:
print(unicode_string.encode("utf8"))

© 2017 OKCPython


In [55]:
print(unicode_string.decode("utf8"))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

In [56]:
print(unicode_string.decode("ascii"))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

# How can you make sense of all of these options?

# How I wrap my brain around it

`text.encode(...)` always returns a **byte string**, whether `text` is a byte string or a unicode string.

`text.decode(...)` always returns a **unicode string**, whether `text` is a byte string or a unicode string.

So what I try to remember is that you **encode** text as **bytes** using a particular **encoding**.

And then I derive from that the idea that you **decode** bytes into **text** by knowing the **encoding** used to serialize the text.


## To convert a byte string to a unicode string, use:

In [57]:
byte_string.decode("utf8")

u'\xa9 2017 OKCPython'

## To convert a unicode string to a byte string, use:

In [61]:
unicode_string.encode("utf8")

'\xc2\xa9 2017 OKCPython'

# Reading Files

In [63]:
import io
with io.open("copyright.txt") as input_file:
    for line in input_file:
        print(type(line))
        print(line)

<type 'unicode'>
✓ Read the File

<type 'unicode'>
✓ Write the File

<type 'unicode'>
Still to do: weird double s thing: §

<type 'unicode'>


<type 'unicode'>
© 2017 OKCPython



In [79]:
import io
with io.open("copyright.txt", "rb") as input_file:
    for line in input_file:
        print(type(line))
        print(line)

<type 'str'>
Read the File

<type 'str'>
Write the File

<type 'str'>
Still to do: weird double s thing: §

<type 'str'>


<type 'str'>
© 2017 OKCPython



In [80]:
import io
with io.open("copyright_windows_1250.txt", "rb") as input_file:
    for line in input_file:
        print(type(line))
        print(line)

<type 'str'>
Read the File

<type 'str'>
Write the File

<type 'str'>
Still to do: weird double s thing: �

<type 'str'>


<type 'str'>
� 2017 OKCPython



In [81]:
import io
with io.open("copyright_windows_1250.txt", "rt") as input_file:
    for line in input_file:
        print(type(line))
        print(line)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 64: invalid start byte

In [82]:
# Read the file with the correct encoding, resulting in unicode strings
import io
with io.open("copyright_windows_1250.txt", "rt", encoding="windows-1250") as input_file:
    for line in input_file:
        print(type(line))
        print(line)

<type 'unicode'>
Read the File

<type 'unicode'>
Write the File

<type 'unicode'>
Still to do: weird double s thing: §

<type 'unicode'>


<type 'unicode'>
© 2017 OKCPython



In [85]:
# Read the file with the correct encoding, resulting in latin-1 encoded byte strings
import io
with io.open("copyright_windows_1250.txt", "rt", encoding="latin-1") as input_file:
    for line in input_file:
        line = line.encode("utf8")
        print(type(line))
        print(line)

<type 'str'>
Read the File

<type 'str'>
Write the File

<type 'str'>
Still to do: weird double s thing: §

<type 'str'>


<type 'str'>
© 2017 OKCPython



# Writing Files

In [18]:
unicode_data = u"© 2017 OKCPython"
bytes_data_utf8 = b"\xc2\xa9 2017 OKCPython"
bytes_data_latin1 = b"\xa9 2017 OKCPython"

In [19]:
# Write to file encoded as UTF-8
import io
with io.open("copyright_output_unicode_to_utf8.txt", "wt") as output_file:
    output_file.write(unicode_data)

In [25]:
# Opening in 'text' mode expects unicode, not bytes!
with io.open("copyright_output_utf8_bytes_to_utf8.txt", "wt") as output_file:
    output_file.write(bytes_data_utf8)

TypeError: must be unicode, not str

In [26]:
# Write utf8-encoded byte string to file encoded as UTF-8
import io
with io.open("copyright_output_utf8_bytes_to_utf8.txt", "wt") as output_file:
    output_file.write(bytes_data_utf8.decode("utf8"))

In [28]:
# Write latin1-encoded byte string to file encoded as UTF-8
import io
with io.open("copyright_output_latin1_bytes_to_utf8.txt", "wt") as output_file:
    output_file.write(bytes_data_latin1.decode("latin-1"))

### Or, we can open the file and write to it in binary (byte) mode

In [15]:
# To write a unicode string to a byte stream, we have to encode it with a particular encoding
import io
with io.open("copyright_output_unicode_to_binary.txt", "wb") as output_file:
    output_file.write(unicode_data.encode("utf8"))

In [29]:
# To write a utf8-encoded byte string to a byte stream, we just have to write it out
import io
with io.open("copyright_output_utf8_bytes_to_binary.txt", "wb") as output_file:
    output_file.write(bytes_data_utf8)

Here is where I find it is easy to get myself in trouble. We wrote a file in binary mode, using a byte string. Often enough, not enough thought went into what the encoding was of the file written.

Was it plain `ascii`, was it `latin-1` (aka `iso-8859-1`, aka `windows-1252` in the HTML5 spec)?

Was it `utf8`? `utf16`? `utf32`?

We have been somewhat spoiled in the US by being able to frequently ignore the encoding of text files, since there is so much overlap between `ascii`, `latin-1`, and `utf8`.


In [30]:
# To write a latin1-encoded byte string to a byte stream,
# I convert it to unicode and then encode it to the desired output encoding
import io
with io.open("copyright_output_latin1_bytes_to_binary.txt", "wb") as output_file:
    output_file.write(bytes_data_latin1.decode("latin-1").encode("utf8"))