## Handling Unicode errors

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#decoding-text-files):

> Text is made of **characters**, but files are made of **bytes**. These bytes represent characters according to some **encoding**. To work with text files in Python, their bytes must be decoded to a character set called **Unicode**. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

**Why should you care?**

When working with text in Python, you are likely to encounter errors related to encoding, and understanding Unicode will help you to troubleshoot these errors.

**Unicode basics:**

- Unicode is a system that assigns a unique number for every character in every language. These numbers are called **code points**. For example, the [code point](http://www.unicode.org/charts/index.html) for "A" is U+0041, and the official name is "LATIN CAPITAL LETTER A".
- An **encoding** specifies how to store the code points in memory:
    - **UTF-8** is the most popular Unicode encoding. It uses 8 to 32 bits to store each character.
    - **UTF-16** is the second most popular Unicode encoding. It uses 16 or 32 bits to store each character.
    - **UTF-32** is the least popular Unicode encoding. It uses 32 bits to store each character.

**ASCII basics:**
- ASCII is an encoding from the 1960's that uses 8 bits to store each character, and only supports **English characters**.
- ASCII-encoded files are sometimes called **plain text**.
- UTF-8 is **backward-compatible** with ASCII, because the first 8 bits of a UTF-8 encoding are identical to the ASCII encoding.

The default encoding in **Python 2** is ASCII. The default encoding in **Python 3** is UTF-8.

In [2]:
# Python 2: examine two types of strings
print(type('hello'))
print(type(u'hello'))

<class 'str'>
<class 'str'>


In [3]:
# Python 2: examine two types of strings
print(type('hello'))
print(type(u'hello'))

<class 'str'>
<class 'str'>


In [4]:
# Python 2: 'encode' converts 'unicode' to 'str'
u'hello'.encode(encoding='utf-8')

b'hello'

In [6]:
# Python 3: examine two types of strings
print(type(b'hello'))
print(type('hello'))

<class 'bytes'>
<class 'str'>


In [None]:
# Python 3: 'decode' converts 'bytes' to 'str'
b'hello'.decode(encoding='utf-8')