Pre-presentation steps
1. Change background color to "lightskyblue"
2. Restart Kernel
3. Clear all Cell Output
4. Run cell below

In [1]:
# Load some definitions
import binascii

def bytestring(s):
    return " ".join([binascii.hexlify(x) for x in s])
    
def utf8(u):
    return bytestring(u.encode('utf-8'))
    
def utf16(u):
    return bytestring(u.encode('utf-16le'))

What is a string?

In [2]:
type("A String")

str

In [3]:
b = b"A Byte String"
u = u"A Unicode String"
type(b), type(u)

(str, unicode)

Adding `str` and `unicode`

In [4]:
"bytes_" + u"unicode"

u'bytes_unicode'

Encoding and Decoding

In [5]:
latin1_string = "\xdc\xf1\xee\xe7\xf8d\xe9"
print bytestring(latin1_string)

dc f1 ee e7 f8 64 e9


In [6]:
latin1_string.encode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128)

In [7]:
latin1_string.encode('utf-8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128)

In [8]:
unicode_string = latin1_string.decode('latin1')
utf8_string = unicode_string.encode('utf8')

print unicode_string, type(unicode_string)
print utf8_string, type(utf8_string)

Üñîçødé <type 'unicode'>
Üñîçødé <type 'str'>


In [9]:
print type(unicode_string)
print type(utf8_string)

<type 'unicode'>
<type 'str'>


More Adding `str` and `unicode`

In [10]:
a = "bytes_" + latin1_string
b = "bytes_" + unicode_string
c = "bytes_" + utf8_string

print repr(a), a
print repr(b), b
print repr(c), c

'bytes_\xdc\xf1\xee\xe7\xf8d\xe9' bytes_�����d�
u'bytes_\xdc\xf1\xee\xe7\xf8d\xe9' bytes_Üñîçødé
'bytes_\xc3\x9c\xc3\xb1\xc3\xae\xc3\xa7\xc3\xb8d\xc3\xa9' bytes_Üñîçødé


System Text Settings

In [11]:
import sys, locale

expressions = """
        locale.getpreferredencoding()
        type(my_file)
        my_file.encoding
        sys.stdout.isatty()
        sys.stdout.encoding
        sys.stdin.isatty()
        sys.stdin.encoding
        sys.stderr.isatty()
        sys.stderr.encoding
        sys.getdefaultencoding()
        sys.getfilesystemencoding()
    """

def get_text_settings():
    my_file = open('/tmp/dummy', 'w')

    for expression in expressions.split():
        value = eval(expression)
        print expression.rjust(30), '->', repr(value)

In [12]:
get_text_settings()

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <type 'file'>
              my_file.encoding -> None
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> None
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'ascii'
   sys.getfilesystemencoding() -> 'utf-8'


Detecting Encodings

In [13]:
import chardet
string1 = '\xd6\xd0\xb9\xfa\xd6\xc6\xd4\xec\xb5\xc4\xc1\xec\xbe\xfc\xd5\xdf3\xc3\xfb'
string2 = '\xe4\xb8\xad\xe5\x9b\xbd\xe5\x88\xb6\xe9\x80\xa0\xe7\x9a\x84\xe9\xa2\x86\xe5\x86\x9b\xe8\x80\x853\xe5\x90\x8d'

print bytestring(string1)
print bytestring(string2)

print chardet.detect(string1)
print chardet.detect(string2)

d6 d0 b9 fa d6 c6 d4 ec b5 c4 c1 ec be fc d5 df 33 c3 fb
e4 b8 ad e5 9b bd e5 88 b6 e9 80 a0 e7 9a 84 e9 a2 86 e5 86 9b e8 80 85 33 e5 90 8d
{'confidence': 0.99, 'encoding': 'GB2312'}
{'confidence': 0.99, 'encoding': 'utf-8'}


In [14]:
# Good decoding
print string1.decode('GB2312')
print string2.decode('utf-8')

中国制造的领军者3名
中国制造的领军者3名


In [15]:
# Bad decoding
try:
    print string1.decode('utf-8')
except Exception as e:
    print e
try:
    print string2.decode('GB2312')
except Exception as e:
    print e

'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte
'gb2312' codec can't decode bytes in position 2-3: illegal multibyte sequence


In [16]:
# Ugly decoding
print string1.decode('utf-8', 'replace')
print string2.decode('gb2312', 'replace')

�й�����������3��
涓���堕�������3��


In [17]:
string = '\xdd\xea\xee\xed\xee\xec\xe8\xea\xe0 \xe8 \xf4\xe8\xed\xe0\xed\xf1\xfb'
print chardet.detect(string)

{'confidence': 0.99, 'encoding': 'MacCyrillic'}


In [18]:
print string.decode("MacCyrillic")
print string.decode("cp1251")

Ёкономика и финансы
Экономика и финансы


Gotcha: Normalization

In [19]:
s1 = u"café"
s2 = u"café"
print len(s1), len(s2)
print s1 == s2

4 5
False


In [20]:
print utf8(s1)
print utf8(s2)
print repr((s1, s2))

63 61 66 c3 a9
63 61 66 65 cc 81
(u'caf\xe9', u'cafe\u0301')


Handing Unicode in the Standard Library 

In [21]:
import os
def find(path):
    for dirpath, dirnames, filenames in os.walk(path):
        for filename in filenames:
            print repr(os.path.join(dirpath, filename))

In [22]:
find('.')

'./Helpers.ipynb'
'./Python 2 Strings.ipynb'
'./Python 3 Strings.ipynb'
'./.ipynb_checkpoints/Helpers-checkpoint.ipynb'
'./.ipynb_checkpoints/Python 2 Strings-checkpoint.ipynb'
'./.ipynb_checkpoints/Python 3 Strings-checkpoint.ipynb'


In [23]:
find(u'.')

u'./Helpers.ipynb'
u'./Python 2 Strings.ipynb'
u'./Python 3 Strings.ipynb'
u'./.ipynb_checkpoints/Helpers-checkpoint.ipynb'
u'./.ipynb_checkpoints/Python 2 Strings-checkpoint.ipynb'
u'./.ipynb_checkpoints/Python 3 Strings-checkpoint.ipynb'


Handling Unicode in Python 2

In [24]:
# Once you run this, all future execution will behave differently.
#from __future__ import unicode_literals
type("this string has no 'u' prefix")

str