---
Hacking Unicode
====

![](images/chacter_encoding.png)

By The End Of This Session You Should Be Able To:
----

- Fix Unicode encoding errors
- Recognize mojibake
- Describe what an Unicode sandwich is
- Make your way out of Unicode Hell (I'm your Virgil)

Remember earlier in the course, we had data locally ...

In [4]:
with open('../../corpora/shakespeare_all.txt') as f:
    shakespeare = f.read()

In [5]:
shakespeare[:50]

'\ufeffTHE SONNETS\nby William Shakespeare\n\n\n\n           '

![](http://replygif.net/thumbnail/100.gif)

---
Mojibake
---

![](http://mihai-nita.net/wp-content/uploads/2006/08/Rez4.gif)

What is mojibake?
---

![](http://s3.media.squarespace.com/production/920827/11462743/_wGr8njEWjtI/S_4ELCN3k7I/AAAAAAAAI3g/0wqbEFX0yjw/s1600/julian%2Bn.ow.thai%2Bfont%2B%2528story%2529.jpg)
Incorrect, unreadable characters shown when computer software fails to show text correctly. 

Why is there mojibake?
-----

It is a result of text being decoded using an unintended character encoding.

[Very common in Japanese websites](https://en.wikipedia.org/wiki/Mojibake), hence the name:  
文字 (moji) "character" + 化け (bake) "transform"

__Bad News__: Looks awful
    
__Good News__: It is systematic (find the right encoding) and easy to fix

In [1]:
import ftfy

In [8]:
ftfy.fix_text(shakespeare[:50])

'THE SONNETS\nby William Shakespeare\n\n\n\n           '

![](http://s2.quickmeme.com/img/0a/0ac84ebf42410c7c7325f8f7120723bd4ccede374902ab5c18000ab323a85a6b.jpg)

[fifty package](https://github.com/LuminosoInsight/python-ftfy/blob/master/README.md) 

It automagically fixes all encoding errors!!!  

In [None]:
ftfy.

What about the line breaks?

In [4]:
with open('../../corpora/shakespeare_all.txt') as f:
    shakespeare = f.read().splitlines()

In [11]:
shakespeare[:10]

['\ufeffTHE SONNETS',
 'by William Shakespeare',
 '',
 '',
 '',
 '                     1',
 '  From fairest creatures we desire increase,',
 "  That thereby beauty's rose might never die,",
 '  But as the riper should by time decease,',
 '  His tender heir might bear his memory:']

In [9]:
# Munging text
shakespeare = [ftfy.fix_text(line.strip()) for line in shakespeare if line]

In [30]:
shakespeare[:10]

['THE SONNETS',
 'by William Shakespeare',
 '1',
 'From fairest creatures we desire increase,',
 "That thereby beauty's rose might never die,",
 'But as the riper should by time decease,',
 'His tender heir might bear his memory:',
 'But thou contracted to thine own bright eyes,',
 "Feed'st thy light's flame with self-substantial fuel,",
 'Making a famine where abundance lies,']

Point to Ponder
-----

<img src="http://www.quickmeme.com/img/ce/ce0e82f74fe1c1585ebdbdc2365bd9a69d222e16dcea6e95390136b2f1093a1f.jpg" style="width: 400px;"/>

Should munging be done on load or after?

---
Unicode Sandwich (only popular in Silicon Valley)
---

<img src="https://www.safaribooksonline.com/library/view/fluent-python/9781491946237/images/flup_0402.png.jpg" style="width: 400px;"/>

<img src="http://1.bp.blogspot.com/-m4BldtOr4gw/UvzKDSP_YNI/AAAAAAAABGI/GgfyUAQuaQU/s1600/UnicodeSandwich.PNG" style="width: 400px;"/>

What is Unicode?
------

Unicode provides a unique number for every character!



no matter what the platform,
no matter what the program,
no matter what the language.

[Source](http://www.unicode.org/standard/WhatIsUnicode.html)

![](images/unicode_support.png)

> Humans use text. Computers speak bytes.  
— Esther Nam and Travis Fischer  

_Character encoding and Unicode in Python_

Unicode Table
-----

In [5]:
from IPython.display import IFrame

IFrame("http://unicode-table.com/en/#0014",
      width=700,
      height=350)

----
Unicode: Do the best you can and then go home
----

<img src="http://imgs.xkcd.com/comics/unicode.png" style="width: 400px;"/>

-----
Summary
----

- Unicode is better than other options, but still kinda sux
- Alway try to keep it Unicode
- Be explicit about encodings
- If you see mojibake, don't ╯°□°）╯︵ ┻━┻. ftfy

<br>
<br>
--

---
Bonus
---


![](http://imgs.xkcd.com/comics/rtl.png )

[Explain xckd](http://www.explainxkcd.com/wiki/index.php/RTL)

<br>
<br>