# Files and Character Encoding

## Open a Text File

The most basic way to open a file is to use Python's built-in `open()` function and to insert your desired file path surrounded by quotation marks. That creates a file object. Then we tack on the `.read()` method to transform the file object into one big string. But, as we will see below, it's good practice to add an extra step here in order to accommodate character encoding issues.

In [3]:
open('sample-character-encoding.txt').read()

'***\nThis is an example of curly quotation marks:\n“She said, ‘I won’t bungle the encoding!’”\n***\n\n***\nThis is an example of an emoji:\n💩\n***\n\n***\nThis is an example of Bengali:\nআদিত্য মুখোপাধ্যায় টাইপ করতে পারেন - তবে নিজের নাম বানান করতে পারবেন না\n(Aditya Mukerjee can type 💩 but cannot spell his own name)\n***\n\n***\nThis is an example of German:\nWas ist, wenn wir über deutsche Sprachen recherchieren wollen?\n(What if we want to research German languages?)\n***'

## Character Encoding

> Written text is a sequence of graphemes – characters. Every character you type – whether letters like ‘a’ and ‘b’, punctuation marks like ‘?’, or even emoji like 💪👬👍 – has an ID number that computers use to store it. In order to communicate, computers need to agree on a common roster of how to assign these graphemes to numbers and vice versa. These rosters are known as character encodings.

> -Aditya Mukerjee, "[I Can Text You A Pile of Poo, But I Can’t Write My Name](https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name)"

Python uses the UTF-8 character encoding, or Unicode, by default. Unicode is the most popular character encoding on the internet and even includes emojis.

However, as Mukerjee points out in his essay, Unicode still does not include characters that are essential to the Bengali alphabet as well as to many other non-English languages.

In [64]:
sample_text_default = open('sample-character-encoding.txt').read()

In [65]:
print(sample_text_default)

***
This is an example of curly quotation marks:
“She said, ‘I won’t bungle the encoding!’”
***

***
This is an example of an emoji:
💩
***

***
This is an example of Bengali:
আদিত্য মুখোপাধ্যায় টাইপ করতে পারেন - তবে নিজের নাম বানান করতে পারবেন না
(Aditya Mukerjee can type 💩 but cannot spell his own name)
***

***
This is an example of German:
Was ist, wenn wir über deutsche Sprachen recherchieren wollen?
(What if we want to research German languages?)
***


Though Python reads UTF-8 by default, it's good practice to explicitly declare UTF-8 encoding, as below.

In [66]:
sample_text_default = open('sample-character-encoding.txt', encoding='utf-8').read()

In [67]:
print(sample_text_default)

***
This is an example of curly quotation marks:
“She said, ‘I won’t bungle the encoding!’”
***

***
This is an example of an emoji:
💩
***

***
This is an example of Bengali:
আদিত্য মুখোপাধ্যায় টাইপ করতে পারেন - তবে নিজের নাম বানান করতে পারবেন না
(Aditya Mukerjee can type 💩 but cannot spell his own name)
***

***
This is an example of German:
Was ist, wenn wir über deutsche Sprachen recherchieren wollen?
(What if we want to research German languages?)
***


Look what happens if we read in the exact same text with a different encoding.

In [69]:
sample_text_iso = open('sample-character-encoding.txt', encoding='iso-8859-1').read()
print(sample_text_iso)

***
This is an example of curly quotation marks:
âShe said, âI wonât bungle the encoding!ââ
***

***
This is an example of an emoji:
ð©
***

***
This is an example of Bengali:
à¦à¦¦à¦¿à¦¤à§à¦¯ à¦®à§à¦à§à¦ªà¦¾à¦§à§à¦¯à¦¾à¦¯à¦¼ à¦à¦¾à¦à¦ª à¦à¦°à¦¤à§ à¦ªà¦¾à¦°à§à¦¨ - à¦¤à¦¬à§ à¦¨à¦¿à¦à§à¦° à¦¨à¦¾à¦® à¦¬à¦¾à¦¨à¦¾à¦¨ à¦à¦°à¦¤à§ à¦ªà¦¾à¦°à¦¬à§à¦¨ à¦¨à¦¾
(Aditya Mukerjee can type ð© but cannot spell his own name)
***

***
This is an example of German:
Was ist, wenn wir Ã¼ber deutsche Sprachen recherchieren wollen?
(What if we want to research German languages?)
***


In [70]:
sample_text_ascii = open('sample-character-encoding.txt', encoding='ascii').read()
print(sample_text_ascii)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 49: ordinal not in range(128)

> If you open a document and it looks like this, there's one and only one reason for it: Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. That's all. The document is not broken (well, unless it is, see below), there's no magic you need to perform, you simply need to select the right encoding to display the document.

> -[What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text](http://kunststube.net/encoding/)