## Encoding and Decoding when working with Text

The way computers interpret texts is different than the way we interpret texts. For us, we can simply read the text as it is and be able to recognize what it means. However, computers do things a little differently. Instead of storing the actual "letters", "numbers", or any kind of data, everything all real world data are represented by *bits*, 0's and 1's.

Now the way a "number" is represented in a computer's memory is as a binary number, a number expressed as a combination of 0's and 1's.

For example:
![](../binary_number_system.png)

But binary numbers can become too long, so we can shorten binary numbers by converting them to hexidecimals using the table below. Starting from the right, every 4 binaries will be converted to 1 hexidecimal.

![](../hexadecimal-number-chart.jpg)

For the example above, the equivalent hexidecimal to the binary number `11101110` is `EE`.

Thus, `119` = `11101110` = `EE` using the systems described above.

In a computer, 8 binary numbers = 1 byte

### Encoding and Decoding

There are numerous ways to convert "letters" to *bits*, called **encoding**. Different computers have different encoding methods due to difference in operating system, languages, etc.

In the past, [American Standard Code for Information Interchange](http://www.asciitable.com) (ASCII) was the standard character encoding across computers. But the number of ways to encoding a character using the ASCII table is very limited. If you clicked on the link, you'll see that there are only 256 character encodings. This means that we would have no way of converting words that are in other languages into *bits*. This is many more character encoding systems have be developed based off of the ASCII character encoding system to be able to encoding more characters. The one character encoding system we use in this class is `UTF-8`, which can handle many more characters. 

**Encoding** basically takes in a string and converted each of the characters, using some kind of encoding system, into computer data, 0's and 1's. 8 binary numbers is equivalent to a byte (computer memory).

**Decoding** is basically the reverse of encoding. Decoding takes in computer data, 0's and 1's, and convert it to text that we are able to intrepet. When decoding, we must specify a character encoding system in order to decoding computer data correctly. Otherwise, we may get garbage text that we cannot understand (example below).

Basically, imagine you want to pass a note to your classmate, but you don't want to have other people know what you're reading. You develop some kind of encoding system where each "letter" maps to some number, and you pass this encoding letter to your friend. But in order for your friend to decode the letter, you must also tell him or her what kind of encoding system you used. This may sound very inefficient if we were to do it, but it is very efficient for a computer.

$$\text{text} \xrightarrow{\text{encoding}} \text{computer data}$$

$$\text{text} \xleftarrow{\text{decoding}} \text{computer data}$$

More details and examples if you're interested: http://kunststube.net/encoding/

### 1. Encoding

Let's try to encode some text using the ASCII character encoding system first and see how limited it is.

In [1]:
string = 'Hello World!'
string_encoded_ASCII = string.encode('ASCII')
string_encoded_ASCII

b'Hello World!'

You can see that if we try to encode a string, the output is the same string except with a b in front. The b means that the thing following it is a bytes type. You don't really have to understand the concept of bytes types for this lecture.

Now let's try to encoding something other than standard English character using ASCII.

In [2]:
string2 = 'première'
string2.encode('ASCII')

UnicodeEncodeError: 'ascii' codec can't encode character '\xe8' in position 5: ordinal not in range(128)

You should get an error from the code above. This is because characters like "è" doesn't exist in the ASCII character encoding system. This is why ASCII is very limited.

Try again with some text that's in a totally different language, like Spanish, Chinese, German, etc.，with characters that are not in standard 26 alphabets letters. You should get the same error.

In [None]:
# Exercise:
string3 = ... # <- fill me in
string3.encode('ASCII')

Now let's try using the `UTF-8` encoding system.

In [3]:
string = 'Hello World!'

In [4]:
string_encoded = string.encode('UTF-8')
string_encoded

b'Hello World!'

In [5]:
type(string_encoded)

bytes

In [6]:
def byteToBinary(byte):
    binary = []
    for b in byte:
        binary.append(bin(b)[2:])
    return binary



In [7]:
string2 = 'première'
string2_encoded = string2.encode('UTF-8')
string2_encoded

b'premi\xc3\xa8re'

Using `UTF-8` encoding system, you should see the standard English characters remain the same, but the special characters got converted to some alien characters in the form `\x..`. The 2 characters in place of the `..` are actually hexidecimals, which if you recall from above, are equivalent to some form of binary numbers. Standard characters remain the same because of the ASCII character encoding system being the standard for computers in the past.

Let's try to encode the foreign language text you had above using `UTF-8` and see what'd we get.

In [None]:
#Exercise
string3_encoded = ... # <- fill me in
string3_encoded

Below is a function that converts the *bytes* to binary that we've seen in the example at the top.

In [8]:
def bytesToBinary(byte):
    binary = []
    for b in byte:
        binary.append("{0:08b}".format(b))
    return binary

binary = bytesToBinary(string_encoded)
binary

['01001000',
 '01100101',
 '01101100',
 '01101100',
 '01101111',
 '00100000',
 '01010111',
 '01101111',
 '01110010',
 '01101100',
 '01100100',
 '00100001']

Above is a list of binary numbers that represents each character in `'Hello World!'`. Below is a better representation of the encoding.

In [10]:
import pandas
df = pandas.DataFrame()
df['string'] = list(string)
df['binary'] = bytesToBinary(string_encoded)
df['integer'] = [b for b in string_encoded]
df['hexidecimal'] = [hex(int(b, 2))[2:] for b in binary]
df

Unnamed: 0,string,binary,integer,hexidecimal
0,H,1001000,72,48
1,e,1100101,101,65
2,l,1101100,108,6c
3,l,1101100,108,6c
4,o,1101111,111,6f
5,,100000,32,20
6,W,1010111,87,57
7,o,1101111,111,6f
8,r,1110010,114,72
9,l,1101100,108,6c


### 2. Decoding

Let's look at how to decoding some computer data.

In [11]:
print('Before decoding:', string2_encoded)

string2_decoded = string2_encoded.decode('UTF-8')
print('After decoding:' , string2_decoded)

Before decoding: b'premi\xc3\xa8re'
After decoding: première


As described above, decoding takes in some encoding system and converts all of the computer data, *bytes*, back into text that we can read and intrepret.

Try decoding the foreign language text you encoded before.

In [None]:
#Exercise:
print('Before decoding:', string3_encoded)

string3_decoded = ... # <- fill me in
print('After decoding:' , string3_decoded)

Try decoding the computer data below and see what the original string is. Before you actually code, can you tell what the original string is just by looking at raw data?

In [None]:
#Exercise:
computer_data = b'\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21'
computer_data_decoded = ...
computer_data_decoded

### 3. Encoding and Decoding Text

Examples and problems below mirrors [this](http://kunststube.net/encoding/) in the "Misconceptions, Confusions And Problems" section.

At the beginning of the semester, we warned you to not open any text file using Microsoft Word. This is because Word, by default, uses the default encoding system that your operating system uses to decode a text. This is means that if your operating system using an encoding system that is different from actual encoding system that the text file was encoded in, then you'll get some weird garbage text that makes no sense.

By default, Microsoft Word will use the defaul encoding system that your operating system uses. On Mac OS, the default encoding system is `mac-roman`. On windows, the default encoding system is `Latin-1`.

Here is an example. Let's say you accidently opened up a text file using Word, and Word converted whatever you had originally to the string below. Let's try to fix this in Python.



In [None]:
string_corrupted = 'ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢'

Let's assume we opened this text file using Word on Mac OS. That means that whatever the original text contained, it probably got converted into the corrupted string above using the `mac-roman` encoding system. Thus, we must first reverse the encoding process.

In [None]:
string_corrupted_encoded = string_corrupted.encode('mac-roman')
string_corrupted_encoded

We converted corrupted string encoded back into computer data using `mac-roman`. Now we have to figure out what the original encoding system was in order to decode this computer data back into the original. Let's try `UTF-8`.

In [None]:
string_corrupted_encoded.decode('UTF-8')

But notice how we're getting an error that is similar to that of when we tried to encode and decode some foreign text using ASCII. This pretty much means that the `UTF-8` encoding system isn't the right encoding scheme to use.

The original text is actually "エンコーディングは難しくない" encoded in Japanese Shift-JIS encoding. Without tell you this, there would've been no way you could've retrieved the original text. But now that we know, can we retrieve the original text?

In [None]:
print('Corrupted string:', string_corrupted, '\n')
original = string_corrupted_encoded.decode('Shift-JIS')
print('Original sentence:', original)

What went wrong here exactly? What happend was your text file was originally encoded with `Shift-JIS`. But since you tried to open it with Microsoft word, it by default tried to use the `mac-roman` encoding system to decode your text file. This resulted in the corrupted string we see above. 

In order to get back the original string, we must work backwards. We must first encode our corrupted string using `mac-roman` encoding system back into computer data. Once we have the corrupted string in computer data, we can then decode this data using the correct encoding system, `Shift-JIS` to get back the original Japanese text.

### Exercises

In the data folder, there are 3 files: `chinese_text.txt`, `chinese_text_mac.txt`, and `chinese_text_windows.txt`.
The last two texts contains some corrupted characters because I opened `chinese_text.txt` using **Microsoft Word** in its corresponding operating system. If you try to open `chinese_text.txt` on your computer using the default encoding system for your operating system, you should get the corresponding corrupted test.

Retrieve the original text.

The original text was encoded using `UTF-8`.

In [88]:
original = open('../data/chinese_text.txt', encoding='utf-8').read() # Don't worry about the encoding part here.
mac = open('../data/chinese_text_mac.txt', encoding='utf-8').read() # This is just to read in the exact test saved 
windows = open('../data/chinese_text_windows.txt', encoding='utf-8').read() # in the text file.
#windows = original.encode('utf-8').decode('latin1')

#text_file = open("../data/chinese_text_windows.txt", "w")
#text_file.write(windows)
#text_file.close()
windows

'å¹¸é\x81\x8bè\x8d\x89ï¼\x8cæ\x98¯å\x9b\x9bè\x91\x89ç\x9a\x84é\x85¢æ¼¿è\x8d\x89å±¬ï¼\x8cä¹\x83æ\x98¯çª\x81è®\x8aç\x9a\x84å\x9b\x9bæ\x9e\x9aå\x80\x92å¿\x83å\x9e\x8bå°\x8fè\x91\x89ã\x80\x82\n\nç\x9b¸å\x82³å¹¸é\x81\x8bè\x8d\x89ç¬¬ä¸\x80ç\x89\x87è\x91\x89å\xad\x90ä»£è¡¨ä¿¡ä»°ï¼\x8cç¬¬äº\x8cç\x89\x87ä»£è¡¨æ\x84\x9bæ\x83\x85ï¼\x8cç¬¬ä¸\x89ç\x89\x87ä»£è¡¨å¸\x8cæ\x9c\x9bã\x80\x82ç\x95¶å¤\x9aå\x87ºä¸\x80ç\x89\x87è\x91\x89å\xad\x90æ\x99\x82ï¼\x8cé\x80\x99ç\x89\x87ä»£è¡¨å¹¸é\x81\x8bã\x80\x82\n\né\x85¢æ¼¿è\x8d\x89æ\x9c\x89é»\x83è\x8a±å\x92\x8cç´«è\x8a±å\x85©ç¨®ï¼\x81ä¸\x80è\x88¬ç¨±ç\x82ºé\x85¢æ¼¿è\x8d\x89ç\x9a\x84ï¼\x8cæ\x98¯æ\x8c\x87é»\x83è\x8a±é\x85¢æ¼¿è\x8d\x89ï¼\x8cé\x96\x8bç´«è\x8a±ç\x9a\x84å\x89\x87ç¨±ä¹\x8bç\x82ºç´«è\x8a±é\x85¢æ¼¿è\x8d\x89ã\x80\x82\n\né\x85¢æ¼¿è\x8d\x89å±¬æ\x96¼é\x85¢æ¼¿è\x8d\x89ç§\x91ï¼\x8cå\x8e\x9fç\x94¢æ\x96¼ç\x86±å¸¶å\x92\x8cäº\x9eç\x86±å¸¶å\x9c°å\x8d\x80ï¼\x8cç\x82ºå¤\x9aå¹´ç\x94\x9fè\x8d\x89æ\x9c¬ç\x9a\x84ç\x90\x83æ\xa0¹æ¤\x8dç\x89©ï¼\x8cä¸\x96ç\x95\x8cå\x90\x84å\x9c°

In [77]:
mac_original = mac.encode('mac-roman').decode('UTF-8')
mac_original

'幸運草，是四葉的酢漿草屬，乃是突變的四枚倒心型小葉。\n\n相傳幸運草第一片葉子代表信仰，第二片代表愛情，第三片代表希望。當多出一片葉子時，這片代表幸運。\n\n酢漿草有黃花和紫花兩種！一般稱為酢漿草的，是指黃花酢漿草，開紫花的則稱之為紫花酢漿草。\n\n酢漿草屬於酢漿草科，原產於熱帶和亞熱帶地區，為多年生草本的球根植物，世界各地總共有將近三百個原生種。是多年生匍匐性草本，有走莖，葉互生，由3片小葉所組成，小葉倒心形，長約0.5～1公分，無柄，先端凹陷，嘗之有酸味。\n\n酢漿草是愛爾蘭的國花，而且童軍也以它做徽章。一般的酢漿草只有三片小葉，偶爾會出現突變的四片小葉個體，稱為「幸運草」，傳說如果有四片小葉的幸運草就能 許願使願望成真，幸運草之所以特別，其實只是一種突變現象，所以幸運草純粹只是突變而來的。偶爾會出現突變的四枚小葉組成的個體，即俗稱的 「幸運草」。\n\n四葉酢醬草一直都被當做幸運的象徵，其實這和有些人有六根手指是一樣的道理，有某個隨機突變使植物長出第四根「手指」，就像遺傳突變使人多長一根手指一樣。\n\n美國農業部證實，產生這種四葉現象的酢醬草其學名是Trifolium repens L.，又稱為白色酢醬草，是一種三葉的多年生草本植物，生長緩慢，但是大約每10,000株當中，會有一株長出四片葉子。\n\n據奧勒岡州立大學植物及植病系教授亞倫李斯頓（Aaron Liston）的說法，太陽的紫外線和肥料中的某些化學物質是造成此類突變的重要原因。而去氧核醣核酸（DNA）發生錯誤也會造成突變，屬於非外力因素。\n\n無論如何，許多國家確實都流傳著四葉幸運酢醬草的傳說，早期威爾斯的塞爾特人相信白色酢醬草可以對抗惡魔。1620年，約翰梅爾頓爵士（Sir John Melton）寫道：如果有人在田間巧遇任何有四片葉子的草，就將會有好運降臨。'

In [70]:
assert(original == mac_original)

The above code makes sure that the `mac_original` you converted must equal to the original text. If you did everything right, nothing would happen. If you converted it wrong, an error will show up.

In [89]:
windows_original = windows.encode('latin1').decode('utf-8')
windows_original

'幸運草，是四葉的酢漿草屬，乃是突變的四枚倒心型小葉。\n\n相傳幸運草第一片葉子代表信仰，第二片代表愛情，第三片代表希望。當多出一片葉子時，這片代表幸運。\n\n酢漿草有黃花和紫花兩種！一般稱為酢漿草的，是指黃花酢漿草，開紫花的則稱之為紫花酢漿草。\n\n酢漿草屬於酢漿草科，原產於熱帶和亞熱帶地區，為多年生草本的球根植物，世界各地總共有將近三百個原生種。是多年生匍匐性草本，有走莖，葉互生，由3片小葉所組成，小葉倒心形，長約0.5～1公分，無柄，先端凹陷，嘗之有酸味。\n\n酢漿草是愛爾蘭的國花，而且童軍也以它做徽章。一般的酢漿草只有三片小葉，偶爾會出現突變的四片小葉個體，稱為「幸運草」，傳說如果有四片小葉的幸運草就能 許願使願望成真，幸運草之所以特別，其實只是一種突變現象，所以幸運草純粹只是突變而來的。偶爾會出現突變的四枚小葉組成的個體，即俗稱的 「幸運草」。\n\n四葉酢醬草一直都被當做幸運的象徵，其實這和有些人有六根手指是一樣的道理，有某個隨機突變使植物長出第四根「手指」，就像遺傳突變使人多長一根手指一樣。\n\n美國農業部證實，產生這種四葉現象的酢醬草其學名是Trifolium repens L.，又稱為白色酢醬草，是一種三葉的多年生草本植物，生長緩慢，但是大約每10,000株當中，會有一株長出四片葉子。\n\n據奧勒岡州立大學植物及植病系教授亞倫李斯頓（Aaron Liston）的說法，太陽的紫外線和肥料中的某些化學物質是造成此類突變的重要原因。而去氧核醣核酸（DNA）發生錯誤也會造成突變，屬於非外力因素。\n\n無論如何，許多國家確實都流傳著四葉幸運酢醬草的傳說，早期威爾斯的塞爾特人相信白色酢醬草可以對抗惡魔。1620年，約翰梅爾頓爵士（Sir John Melton）寫道：如果有人在田間巧遇任何有四片葉子的草，就將會有好運降臨。'

In [90]:
assert(original == windows_original)