# Unicode in Python
## A Continuous Battle

**Conventions:**

I'm going to say **"bytes"** or **"byte string"** to mean the Python 2 "str" type and the Python 3 "bytes" type.

I'm going to say **"unicode string"** to mean the Python 2 "unicode" type and the Python 3 "str" type.

*I really want to call them **text string** instead of **unicode string** to clarify their purpose, but I think that would confuse things at the moment.*

I'm going to prefix my string literals with **b** or **u** to try to make it obvious which of the two I mean.

In [160]:
b"some bytes"  # this is a byte string
u"some text"   # this is a unicode string

u'some text'

In [161]:
from __future__ import print_function
byte_string = b"© 2017 OKCPython"  # utf-8 encoded byte string
print(type(byte_string))
print(repr(byte_string))
print(byte_string)

<type 'str'>
'\xc2\xa9 2017 OKCPython'
© 2017 OKCPython


In [119]:
unicode(byte_string)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

# What's the correct way to convert the byte string to a unicode string?

In [120]:
byte_string.encode(u"latin-1")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

In [121]:
byte_string.encode(u"utf-8")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

In [122]:
byte_string.decode(u"ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

In [123]:
byte_string.decode(u"latin-1")

u'\xc2\xa9 2017 OKCPython'

In [124]:
print(byte_string.decode(u"latin-1"))

Â© 2017 OKCPython


In [125]:
byte_string.decode(u"utf-8")

u'\xa9 2017 OKCPython'

In [126]:
unicode_string = byte_string.decode(u"utf-8")
print(unicode_string)

© 2017 OKCPython


# What does it take to get back to a byte string?

In [127]:
print(str(unicode_string))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

In [128]:
print(unicode_string.encode(u"ascii"))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

In [129]:
print(unicode_string.encode(u"utf-8"))

© 2017 OKCPython


In [130]:
print(unicode_string.decode(u"utf-8"))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

In [131]:
print(unicode_string.decode(u"ascii"))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)

# How can you make sense of all of these options?

# How I wrap my brain around it

`text.encode(...)` always returns a **byte string**, whether `text` is a byte string or a unicode string.

`text.decode(...)` always returns a **unicode string**, whether `text` is a byte string or a unicode string.

So what I try to remember is that you **encode** text as **bytes** using a particular **encoding**.

And then I derive from that the idea that you **decode** bytes into **text** by knowing the **encoding** used to serialize the text.


## To convert a byte string to a unicode string, use:

In [132]:
byte_string.decode(u"utf-8")

u'\xa9 2017 OKCPython'

## To convert a unicode string to a byte string, use:

In [133]:
unicode_string.encode(u"utf-8")

'\xc2\xa9 2017 OKCPython'

# Is a unicode string the same thing as utf-8?

No!

UTF-8 is a way to encode text as bytes. UTF-8 would be a byte string that is the serialized form of a unicode string.

Also, UTF-8 isn't the only way to encode Unicode strings.

In [164]:
unicode_data = u"© 2017 OKCPython ✓"
bytes_data_utf8 = unicode_data.encode(u"utf-8")
bytes_data_utf16 = unicode_data.encode(u"utf-16")

print("len unicode:", len(unicode_data))
print("len utf-8:", len(bytes_data_utf8))
print("len utf-16:", len(bytes_data_utf16))
print("18 * 2 = ", 18 * 2)
print(repr(bytes_data_utf16))
print(repr(u"".encode(u"utf-16")))
assert bytes_data_utf8 != bytes_data_utf16


len unicode: 18
len utf-8: 21
len utf-16: 38
18 * 2 =  36
"\xff\xfe\xa9\x00 \x002\x000\x001\x007\x00 \x00O\x00K\x00C\x00P\x00y\x00t\x00h\x00o\x00n\x00 \x00\x13'"
'\xff\xfe'


AssertionError: 

In [165]:
unicode_data = u"2017 OKCPython"
bytes_data_utf8 = unicode_data.encode(u"utf-8")
bytes_data_latin1 = unicode_data.encode(u"latin-1")
bytes_data_windows1250 = unicode_data.encode(u"windows-1250")

print("len unicode:", len(unicode_data))
print("len utf-8:", len(bytes_data_utf8))
print("len latin-1:", len(bytes_data_latin1))
print("len windows-1250:", len(bytes_data_windows1250))
print(repr(unicode_data))
print(repr(bytes_data_utf8))
print(repr(bytes_data_latin1))
print(repr(bytes_data_windows1250))

assert unicode_data == bytes_data_utf8 == bytes_data_latin1 == bytes_data_windows1250


len unicode: 14
len utf-8: 14
len latin-1: 14
len windows-1250: 14
u'2017 OKCPython'
'2017 OKCPython'
'2017 OKCPython'
'2017 OKCPython'


In [193]:
unicode_data = u"© 2017 OKCPython Ô"
bytes_data_utf8 = unicode_data.encode(u"utf-8")
bytes_data_latin1 = unicode_data.encode(u"latin-1")
bytes_data_windows1250 = unicode_data.encode(u"windows-1250")

print("len unicode:", len(unicode_data))
print("len utf-8:", len(bytes_data_utf8))
print("len latin-1:", len(bytes_data_latin1))
print("len windows-1250:", len(bytes_data_windows1250))
print(repr(unicode_data))
print(repr(bytes_data_utf8))
print(repr(bytes_data_latin1))
print(repr(bytes_data_windows1250))

assert bytes_data_latin1 == bytes_data_windows1250


len unicode: 18
len utf-8: 20
len latin-1: 18
len windows-1250: 18
u'\xa9 2017 OKCPython \xd4'
'\xc2\xa9 2017 OKCPython \xc3\x94'
'\xa9 2017 OKCPython \xd4'
'\xa9 2017 OKCPython \xd4'


In [198]:
unicode_data = u"Bõrgen Görgon"
bytes_data_utf8 = unicode_data.encode(u"utf-8")
bytes_data_latin1 = unicode_data.encode(u"latin-1")

print("len unicode:", len(unicode_data))
print("len utf-8:", len(bytes_data_utf8))
print("len latin-1:", len(bytes_data_latin1))
print(repr(unicode_data))
print(repr(bytes_data_utf8))
print(repr(bytes_data_latin1))

# Cannot be encoded using windows-1250 encoding
bytes_data_windows1250 = unicode_data.encode(u"windows-1250")

len unicode: 13
len utf-8: 15
len latin-1: 13
u'B\xf5rgen G\xf6rgon'
'B\xc3\xb5rgen G\xc3\xb6rgon'
'B\xf5rgen G\xf6rgon'


UnicodeEncodeError: 'charmap' codec can't encode character u'\xf5' in position 1: character maps to <undefined>

In [202]:
unicode_data = u"Bőrgen Görgon"
bytes_data_utf8 = unicode_data.encode(u"utf-8")
bytes_data_windows1250 = unicode_data.encode(u"windows-1250")

print("len unicode:", len(unicode_data))
print("len utf-8:", len(bytes_data_utf8))
print("len windows-1250:", len(bytes_data_windows1250))
print(repr(unicode_data))
print(repr(bytes_data_utf8))
print(repr(bytes_data_windows1250))

# Cannot be encoded using latin-1 encoding
bytes_data_latin1 = unicode_data.encode(u"latin-1")

len unicode: 13
len utf-8: 15
len windows-1250: 13
u'B\u0151rgen G\xf6rgon'
'B\xc5\x91rgen G\xc3\xb6rgon'
'B\xf5rgen G\xf6rgon'


UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0151' in position 1: ordinal not in range(256)

# How do I deal with comparing strings?

In [216]:
scraped_page_data = u"Bõrgen Görgon was here"
name = b'B\xf5rgen G\xf6rgon'

scraped_page_data != name

# Both of these will fail with a similar error
name in scraped_page_data
u"B" in name



UnicodeDecodeError: 'ascii' codec can't decode byte 0xf5 in position 1: ordinal not in range(128)

* Make sure you know the types of the strings you are dealing with
* I strongly recommend adding `"from __future__ import unicode_literals"` at the top of your python2 files.
* Convert all strings to unicode strings to perform the comparison
* For the example above:

In [220]:
scraped_page_data = u"Bõrgen Görgon was here"
name = b'B\xf5rgen G\xf6rgon'

# [Group effort here]

scraped_page_data != name
name in scraped_page_data
u"B" in name



UnicodeDecodeError: 'ascii' codec can't decode byte 0xf5 in position 1: ordinal not in range(128)

# When should I use the encoding:
* ascii
* latin-1
* utf-8
* utf-16

# Reading Files

In [134]:
import io
with io.open(u"copyright_input_utf8.txt") as input_file:
    for line in input_file:
        print(type(line))
        print(repr(line))
        print(line)

<type 'unicode'>
u'\u2713 Read the File\n'
✓ Read the File

<type 'unicode'>
u'\u2713 Write the File\n'
✓ Write the File

<type 'unicode'>
u'Still to do: weird double s thing: \xa7\n'
Still to do: weird double s thing: §

<type 'unicode'>
u'\n'


<type 'unicode'>
u'\xa9 2017 OKCPython\n'
© 2017 OKCPython



In [135]:
import io
with io.open(u"copyright_input_utf8.txt", u"rb") as input_file:
    for line in input_file:
        print(type(line))
        print(repr(line))
        print(line)

<type 'str'>
'\xe2\x9c\x93 Read the File\n'
✓ Read the File

<type 'str'>
'\xe2\x9c\x93 Write the File\n'
✓ Write the File

<type 'str'>
'Still to do: weird double s thing: \xc2\xa7\n'
Still to do: weird double s thing: §

<type 'str'>
'\n'


<type 'str'>
'\xc2\xa9 2017 OKCPython\n'
© 2017 OKCPython



In [136]:
import io
with io.open(u"copyright_input_windows_1250.txt", u"rb") as input_file:
    for line in input_file:
        print(type(line))
        print(repr(line))
        print(line)

<type 'str'>
'Read the File\n'
Read the File

<type 'str'>
'Write the File\n'
Write the File

<type 'str'>
'Still to do: weird double s thing: \xa7\n'
Still to do: weird double s thing: �

<type 'str'>
'\n'


<type 'str'>
'\xa9 2017 OKCPython\n'
� 2017 OKCPython



In [137]:
import io
with io.open(u"copyright_input_windows_1250.txt", u"rt") as input_file:
    for line in input_file:
        print(type(line))
        print(repr(line))
        print(line)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 64: invalid start byte

In [138]:
# Read the file with the correct encoding, resulting in unicode strings
import io
filename = u"copyright_input_windows_1250.txt"
with io.open(filename, u"rt", encoding=u"windows-1250") as input_file:
    for line in input_file:
        print(type(line))
        print(repr(line))
        print(line)

<type 'unicode'>
u'Read the File\n'
Read the File

<type 'unicode'>
u'Write the File\n'
Write the File

<type 'unicode'>
u'Still to do: weird double s thing: \xa7\n'
Still to do: weird double s thing: §

<type 'unicode'>
u'\n'


<type 'unicode'>
u'\xa9 2017 OKCPython\n'
© 2017 OKCPython



In [139]:
# Read the file with the correct encoding, resulting
# in latin-1 encoded byte strings
import io
filename = u"copyright_input_windows_1250.txt"
with io.open(filename, u"rt", encoding=u"windows-1250") as input_file:
    for line in input_file:
        line = line.encode(u"utf-8")
        print(type(line))
        print(repr(line))
        print(line)

<type 'str'>
'Read the File\n'
Read the File

<type 'str'>
'Write the File\n'
Write the File

<type 'str'>
'Still to do: weird double s thing: \xc2\xa7\n'
Still to do: weird double s thing: §

<type 'str'>
'\n'


<type 'str'>
'\xc2\xa9 2017 OKCPython\n'
© 2017 OKCPython



# Writing Files

In [140]:
unicode_data = u"© 2017 OKCPython"
bytes_data_utf8 = b"\xc2\xa9 2017 OKCPython"
bytes_data_latin1 = b"\xa9 2017 OKCPython"

In [141]:
# Write to file encoded as UTF-8
import io
filename = u"copyright_output_unicode_to_utf8.txt"
with io.open(filename, u"wt") as output_file:
    output_file.write(unicode_data)

In [142]:
# Opening in 'text' mode expects unicode, not bytes!
filename = u"copyright_output_utf8_bytes_to_utf8.txt"
with io.open(filename, u"wt") as output_file:
    output_file.write(bytes_data_utf8)

TypeError: write() argument 1 must be unicode, not str

In [143]:
# Write utf8-encoded byte string to file encoded as UTF-8
import io
filename = u"copyright_output_utf8_bytes_to_utf8.txt"
with io.open(filename, u"wt") as output_file:
    output_file.write(bytes_data_utf8.decode(u"utf-8"))

In [144]:
# Write latin1-encoded byte string to file encoded as UTF-8
import io
with io.open(u"copyright_output_latin1_bytes_to_utf8.txt", u"wt") as output_file:
    output_file.write(bytes_data_latin1.decode("latin-1"))

### Or, we can open the file and write to it in binary (byte) mode

In [145]:
# To write a unicode string to a byte stream, we have
# to encode it with a particular encoding
import io
filename = u"copyright_output_unicode_to_binary.txt"
with io.open(filename, u"wb") as output_file:
    output_file.write(unicode_data.encode(u"utf-8"))

In [146]:
# To write a utf8-encoded byte string to a byte stream, we
# just have to write it out
import io
filename = u"copyright_output_utf8_bytes_to_binary.txt"
with io.open(filename, u"wb") as output_file:
    output_file.write(bytes_data_utf8)

Here is where I find it is easy to get myself in trouble. We wrote a file in binary mode, using a byte string. Often enough, not enough thought went into what the encoding was of the file written.

Was it plain `ascii`, was it `latin-1` (aka `iso-8859-1`, aka `windows-1252` in the HTML5 spec)?

Was it `utf8`? `utf16`? `utf32`?

We have been somewhat spoiled in the US by being able to frequently ignore the encoding of text files, since there is so much overlap between `ascii`, `latin-1`, and `utf8`.


In [222]:
# To write a latin1-encoded byte string to a byte stream,
# I convert it to unicode and then encode it to the desired output encoding
import io
filename = u"copyright_output_latin1_bytes_to_binary.txt"
with io.open(filename, u"wb") as output_file:
    output_file.write(bytes_data_latin1.decode(u"latin-1").encode(u"utf-8"))

# Questions?