# Python Unicode, how to avoid UnicodeEncodeError?

Based on https://nedbatchelder.com/text/unipain.html

In Python 2, there are two different string data types:
* a plain-old string literal gives you a `str` object, which stores __bytes__
* If you use a `u` prefix, you get a `unicode` object, which stores __code points__ 

In [1]:
my_string = "Hello World"
type(my_string)

str

In [2]:
my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
type(my_unicode)

unicode

In [3]:
print my_unicode

Hi ℙƴ☂ℌøἤ


### `.encode()` and `.decode()`

Unicode strings have an encode method which turns code points into bytes.

Byte strings have a decode method wich turns bytes into code points.

unicode `.encode()` → bytes

bytes `.decode()` → unicode

In [4]:
my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
len(my_unicode)

9

In [5]:
my_utf8 = my_unicode.encode('utf-8')
len(my_utf8)

19

In [6]:
my_utf8

'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'

In [7]:
my_utf8.decode('utf-8')

u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24'

### Encoding errors

Many encodings only do a subset of Unicode

In [11]:
my_unicode.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-8: ordinal not in range(128)

### Error handling

In [12]:
my_unicode.encode("ascii", "replace")

'Hi ??????'

In [13]:
my_unicode.encode("ascii", "xmlcharrefreplace")

'Hi &#8473;&#436;&#9730;&#8460;&#248;&#7972;'

In [14]:
my_unicode.encode("ascii", "ignore")

'Hi '

### Implicit conversion

Mixing bytes and unicode implicitly decodes

In [15]:
u"Hello " + "world"

u'Hello world'

In [16]:
u"Hello " + ("world".decode("ascii"))

u'Hello world'

In [18]:
import sys
sys.getdefaultencoding()

'ascii'

### Python 2 is “helpful”
Converting implicitly: helpful?

Works great when everything is ASCII

When that fails: __PAIN__

### Pro tip #1: Unicode sandwich

Bytes on the outside, unicode on the inside

Encode/decode at the edges

(you want to decode bytes as soon as possible, and keep them as long as possible in unicode until you encode them)

### Pro tip #2: Know what you have

Bytes or Unicode?

If bytes, what encoding?

In [21]:
print type(my_unicode) 

<type 'unicode'>


In [22]:
print repr(my_unicode)

u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24'


### Pro tip #3: Test Unicode