# How to handle Unicode in Python 2?
Quoted from Philip Guo [blog](http://www.pgbovine.net/unicode-python.htm):

>In Python 2, if you see a `str` object, convert it to a unicode object right away by calling `.decode('utf-8')`. Process all strings as `unicode` objects, not `str` objects. If you need to write a `unicode` object out to a file or database, first call .`encode('utf-8')` on it. 

* This is the string text, write normal text

In [1]:
str_text = 'Friedrichstra\xc3\x9fe'  
print(type(str_text))
with open("file1.txt", 'w') as file:
    file.write(str_text)
    # Expect a file with Friedrichstraße

SyntaxError: invalid syntax (<ipython-input-1-d66b9040d311>, line 2)

* This is unicode text. It needs to be __encoded__ before writing to file

In [None]:
unicode_text = u'Friedrichstra\xdfe' 
print type(unicode_text)
with open("file2.txt", 'w') as file:
    file.write(unicode_text.encode("utf-8"))
    # Expect a file with Friedrichstraße

* Read a file with unicode character

In [None]:
with open("file2.txt", 'r') as file:
    for line in file:
        print line

# How to convert from str to unicode? 
### Only apply in order:
* str     ====(decode)====> unicode
* unicode ====(encode)====> str

Other methods would raise error. 
* `str_text.encode("utf-8")`
* `unicode_text.decode("utf-8")` 

In [None]:
str_text.decode("utf-8") == unicode_text

In [None]:
str_text == unicode_text.encode("utf-8")

* Memebership check is also a form of comparison check. If both are not unicode, the result will be False by default. 

In [None]:
unicode_text in [str_text]

* Good practice: Convert both in Unicode

In [None]:
unicode_text in [str_text.decode('utf-8')]

### The reverse direction works only if there is no unicode character in the text (only ASCII)

In [None]:
"abc".encode('utf-8')  # But the following would throw error str_text.encode('utf-8') because of non-ASCII ß

In [None]:
"abc".decode('utf-8')  # But unicode_text.decode('utf-8') wouth throw error because of non-ASCII ß

### In the mixture of both types

In [None]:
unicode_text in [unicode_text]

# Write unicode to CSV file 
[StackOverflow](https://stackoverflow.com/questions/17245415/read-and-write-csv-files-including-unicode-with-python-2-7)

In [None]:
import csv

tests={'German': [u'Straße',u'auslösen',u'zerstören'], 
       'French': [u'français',u'américaine',u'épais'], 
       'Chinese': [u'中國的',u'英語',u'美國人']}

with open('utf.csv','w') as fout:
    writer=csv.writer(fout)    
    writer.writerows([tests.keys()])
    for row in zip(*tests.values()):
        row=[s.encode('utf-8') for s in row]
        writer.writerows([row])

with open('utf.csv','r') as fin:
    reader=csv.reader(fin)
    for row in reader:
        temp=list(row)
        fmt=u'{:<15}'*len(temp)
        print fmt.format(*[s.decode('utf-8') for s in temp])

# Using unicodecsv

In [None]:
import unicodecsv as csv

filename = "unicode.csv"

# Write to file
with open(filename, 'w') as f:
    w = csv.writer(f, encoding='utf-8')
    w.writerow([u'é', u'ñ', 'a'])
    
# Read from file
with open(filename, 'r') as f:
    w = csv.reader(f, encoding='utf-8')
    print next(w)


# Regular Expression with Unicode
* Python 2: The expression `\W+` can be applied to unicode characters (i.e. é, ö, ü) with re.UNICODE flag activated
* Python 3: Depending on LOCALE, but in general, no flag setting is needed. 

In [None]:
import re
print(re.findall(r'\w+', 'abc def güi jkl'))  # ü is not understood

print(re.findall(r'\w+', u'abc def güi jkl'))  # ü is not understood

# Activate re.UNICODE
print(re.findall(r'\w+', 'abc def güi jkl', re.UNICODE))  # güi is understood as a whole

# More indept about Unicode
Unicode Howto [Python Official Guide](https://docs.python.org/2/howto/unicode.html)
* Treatment of special convertion functions such as: `str`, `unicode`, `chr`