# Follow best practices
___

* Only use UTF-8 strings internally (i.e. in the code itself)
* Try to stick to Python 3
    * Python 3 str type supports Unicode (yay!)
    * Python 2 str type supports ASCII (boo!)
* If using anything other than UTF-8:
    * Decode text as soon as you read it in
    * Re-Encode at the last possible moment
* Try to avoid changing encodings a lot   
    
For more information, see [the Python section of *Programming with Unicode*, by Victor Stinner](http://unicodebook.readthedocs.io/programming_languages.html#python) and the [Unicode HOWTO section of the Python 3 docs](https://docs.python.org/release/3.0.1/howto/unicode.html#python-s-unicode-support).

In [1]:
# example of decoding & re-encoding

# read in file (automatically converted to Unicode 8)
with open("../input/yan_BIG-5.txt", encoding="big5") as f:
    # read in 5000 bytes from our text file
    lines = f.readlines(5000)

# check out the last line
last_line = lines[len(lines) - 1]
print("In unicode: ", last_line)

# write out just the last line in the original encoding
# make sure you open the file in binary mode (the "b" in "wb")
with open("big5_output.txt", "wb") as f:
    # convert back to big5 as we write out our file
    f.write(last_line.encode("big5"))

# take a look to see how the encoding changes our file
print("In BIG-5: ", last_line.encode("big5"))

In unicode:      《家語》曰：「君子不博，為其兼行惡道故也。」《論語》云：「不

In BIG-5:  b'    \xa1m\xaea\xbby\xa1n\xa4\xea\xa1G\xa1u\xa7g\xa4l\xa4\xa3\xb3\xd5\xa1A\xac\xb0\xa8\xe4\xad\xdd\xa6\xe6\xb4c\xb9D\xacG\xa4]\xa1C\xa1v\xa1m\xbd\xd7\xbby\xa1n\xa4\xaa\xa1G\xa1u\xa4\xa3\n'


Why is it such a big deal to only use UTF-8? Because basic string mainpulation functions assume you're going to pass them UTF-8 and reacts accordingly. 

In [2]:
print(last_line)
print() # print a blank line
print("Actual length:", len(last_line))
print("Length with wrong encoding:", len(last_line.encode("big5")))

    《家語》曰：「君子不博，為其兼行惡道故也。」《論語》云：「不


Actual length: 35
Length with wrong encoding: 65


You also don't want to go around changing character encodings willy-nilly. If the conversion process raises an error and some characters are replaced with the character used for unknown characters, you'll lose the underlying byte string when you try to convert back to utf-8. As a result, you'll lose the underlying information and won't be able to get it back (especially if you're modifying files in place).

In [3]:
# start with a string
before = "€"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("big5", errors = "replace")

# convert it back to utf-8
print(after.decode("big5"))

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(

?


# Automatically guess character encodings
___

You can automatically guess the correct character encoding for a file using the Python Module chardet. (The documentation is [here](http://chardet.readthedocs.io/en/latest/), but note that the code examples are all in Python 2.) This won't *always* work, but it's a good start. 

In [4]:
# import a library to detect encodings
import chardet
import glob

# for every text file, print the file name & guess its file encoding
print("File".ljust(45), "Encoding")
for filename in glob.glob('../input/*.txt'):
    with open(filename, 'rb') as rawdata:
        result = chardet.detect(rawdata.read())
    print(filename.ljust(45), result['encoding'])

File                                          Encoding
../input/shisei_UTF-8.txt                     UTF-8-SIG
../input/harpers_ASCII.txt                    ascii
../input/yan_BIG-5.txt                        Big5
../input/olaf_Windows-1251.txt                windows-1251
../input/portugal_ISO-8859-1.txt              ISO-8859-1
../input/die_ISO-8859-1.txt                   ISO-8859-1


We can also use this to build a quick test to see if our files are in UTF-8.

In [5]:
# function to test if a file is in unicode
def is_it_unicode(filename):
    with open(filename, 'rb') as f:
        encoding_info = chardet.detect(f.read())
        if "UTF-8" not in encoding_info['encoding']: 
            print("This isn't UTF-8! It's", encoding_info['encoding'])
        else: 
            print("Yep, it's UTF-8!")

# test our function, the first one is not unicode, the second one is!
is_it_unicode("../input/die_ISO-8859-1.txt")
is_it_unicode("../input/shisei_UTF-8.txt")

This isn't UTF-8! It's ISO-8859-1
Yep, it's UTF-8!


# Ungarble your Unicode
____

Sometimes you'll end up with valid Unicode that just has some specific garbled characrters in it, especially if it's text that's been copied and pasted back and forth between 

These examples are from the [ftfy module documentation](https://ftfy.readthedocs.io/en/latest/).

In [6]:
# import the "fixed that for you" module
import ftfy

# use ftfy to guess what the underlying unicode should be
print(ftfy.fix_text("The puppyÃ¢â‚¬â„¢s paws were huge."))

The puppy's paws were huge.


In [7]:
# use ftfy to guess what the underlying unicode should be
print(ftfy.fix_text("&macr;\\_(ã\x83\x84)_/&macr;"))

¯\_(ツ)_/¯
