# Encoding

Goals:
    - A string is more than a sequence of bytes
    - A string is a couple (bytes, encoding)
    - Use unicode_literals  in python2
    - Manage differently encoded filenames
    - A string is not a sequence of bytes

Modules:

In [9]:
import os
import os.path
import glob

basedir = "/tmp/course"
if not isdir(basedir):
    os.makedirs(basedir) 


In [4]:
#Encoding is a map

# Py3 doesn't need the 'u'
the_string = u"S\u00fcd" # Sued
print(the_string)

Süd


In [None]:
# can be encoded in different...
in_utf8 = the_string.encode('utf-8')
in_win = the_string.encode('cp1252')

# ...byte-sequences
assert type(in_utf8) == bytes 

In [1]:
# Decoding bytes using the wrong map...
# ...gives SÃ¼d results
print(in_utf8.decode('cp1252'))

SÃ¼d


In [15]:
# Filenames are actually binary data
#  we should be careful when our scripts read
#  eg from a vfat filesystem.

# To make Py2 encoding-aware we must
from __future__ import unicode_literals, print_function

# Create 3 windows-encoded filenames in

# using the provided function
from course import create_espana
create_espana(basedir)


In [17]:
# Just list the newly created files
# and check that they are not showing correctly (unless we have windows :D)
!dir {basedir}

Espa\351a.0.txt  Espa\351a.1.txt  Espa\351a.2.txt


In [18]:
from glob import glob as ls 
#expands wildcards like ls

# To avoid encoding issue like the following...
files = ls("/tmp/course/*.txt")

#UnicodeDecodeError: 'ascii' codec can't decode
# byte 0xe9 in position 5: # remember ñ in cp1252
# ordinal not in range(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 5: ordinal not in range(128)

In [19]:
# We must explicitly use bytes
files = ls(b"/tmp/course/*.txt")
print(files)

['/tmp/course/Espa\xe9a.0.txt', '/tmp/course/Espa\xe9a.1.txt', '/tmp/course/Espa\xe9a.2.txt']
