# Files & Character Encoding — Workbook

In this workbook, we're going to explore how to read and write files with the correct character encodings.

> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness..."  

> Charles Dickens, *A Tale of Two Cities* (1859)

## Ask for Help

In [None]:
help(open)

## Open a Text File

If you want to read or write a text file with Python, it is necessary to first open the file. To open a file, you can use Python's built-in `open()` function.

Let's open the file "Charles-Dickens-excerpt.txt," which is in our current directory.

In [None]:
open("Charles-Dickens-excerpt.txt", mode='r', encoding='utf-8')

Inside the `open()` function parentheses, you insert the filepath to be opened in quotation marks. You should also insert a character encoding. This function returns what's called a *file object*.

## Read a Text File

A file object does not contain readable text yet. To read this file object as text, you need to use the `.read()` method. 

In [None]:
open('Charles-Dickens-excerpt.txt', mode='r', encoding='utf-8').read()

If you want to read in the file line by line, you can use the `.readlines()` method.

In [None]:
open('Charles-Dickens-excerpt.txt', mode='r', encoding='utf-8').readlines()

In [None]:
type(open('Charles-Dickens-excerpt.txt', mode='r', encoding='utf-8').readlines())

The `.readlines()` method will return a list of lines, which we will talk about in a moment.

## What is `\n`?

In [None]:
print(open('Charles-Dickens-excerpt.txt', mode='r', encoding='utf-8').read())

## Assign Text to a Variable

How would we assign this text to a variable?

In [None]:
dickens = open('Charles-Dickens-excerpt.txt', mode='r', encoding='utf-8').read()

In [None]:
dickens

In [None]:
type(dickens)

## Remix!

| **String Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `string.replace('old string', 'new string')`      | replaces `old string` with `new string`          |
                                                            

You can use the string method `.replace()` to replace text inside a string with different text. Replace the word "times" with a different word of your choice.

In [None]:
dickens.replace('times', 'tacos')

Let's assign this remix to a new variable `remixed_dickens`.

In [None]:
remixed_dickens = dickens.replace('times', 'tacos')

## Write a Text File

The default mode for the `open()` function is to read text files: `mode = 'r'`.

But you can use the `open()` function to write files, too. Simply set the mode to write: `mode = 'w'`

Let's create a new text file with our remixed version of the opening line of *A Tale of Two Cities*.

In [None]:
open("Remixed-Dickens.txt", mode='w', encoding='utf-8').write(remixed_dickens)

Double-click on your file in the file browser at the left and see if it worked!

## Character Encodings

Why do we need to include `encoding='utf-8'` to open our text file?

UTF-8 is a character encoding known as [Unicode](https://home.unicode.org/basic-info/faq/). We need to specify a character encoding because — *gasp!* — computers don't actually know what text is. Character encodings are systems that map characters to numbers.

You can check any characters' "code point," or place in the Unicode universe, with the function `ord()`

In [None]:
ord("a")

In [None]:
ord("💩")

In [None]:
ord("ত")

In [None]:
ord("!")

Unicode is the most popular character encoding on the internet. It even includes emojis. Yet, as Aditya Mukerjee points out in his essay "[I Can Text You A Pile of Poo, But I Can’t Write My Name](https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name)", Unicode still does not include characters that are essential to Bengali as well as to many other non-English languages.

## Adding (UTF-8) Encoding

It's always good practice to explicitly specify UTF-8 encoding when reading and writing files. Let's open and read "sample-character-encoding.txt"

In [None]:
sample_text_default = open('sample-character-encoding.txt', encoding='utf-8').read()
print(sample_text_default)

Look what happens if we read in the exact same text with a different encoding.

In [None]:
sample_text_iso = open('sample-character-encoding.txt', encoding='iso-8859-1').read()
print(sample_text_iso)

## Real Life Encoding Example — "The Lady with a Dog"

I wanted to find some Russian fiction that we could analyze together as a class, so I reached out to my friend and colleague Quinn Dombrowski, a DH researcher at Stanford who specializes in non-English DH.

She directed me to [Lib.ru](http://lib.ru/), a website that hosts a lot of Russian-language texts, where I was able to find Anton Chekhov's short story, ["The Lady with a Dog"](http://lib.ru/LITRA/CHEHOW/d.txt) (1899), in the original Russian.

On the [web page](http://lib.ru/LITRA/CHEHOW/d.txt) for "Lady with a Dog," I selected the option in the right-hand corner to download the "txt" file. But when I opened it up on my computer, I realized that it had an unfamiliar character encoding. It wouldn't open with UTF-8.

In [None]:
print(open('../texts/literature/Lady-With-a-Dog_Chekov-KOI8R.txt', encoding='UTF-8').read())

So I used the Python package [chardet](https://github.com/chardet/chardet) (explained in the "Ω-Convert-to-UTF8" notebook) to identify that it was not UTF-8 but [KOI8-R](https://en.wikipedia.org/wiki/KOI8-R), a character encoding designed for Russian in the 1990s.

In [None]:
print(open('../texts/literature/Lady-With-a-Dog_Chekov-KOI8R.txt', encoding='KOI8-R').read())

So with the right encoding, we can count the most frequent words in the Russian-language version of "Lady with a Dog"! But what do you notice as potential issues when running this script on a non-English language?

In [None]:
import re
from collections import Counter

def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

filepath_of_text = '../texts/literature/Lady-With-a-Dog_Chekov-KOI8R.txt'
number_of_desired_words = 40

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

full_text = open(filepath_of_text, encoding='KOI8-R').read()

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in stopwords]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

most_frequent_meaningful_words

## Advanced

## Read and Write Files Using a `with` Statement

There is another way that you can open files in Python, which is to use something called a `with` statement. You'll probably see this method if you look at code on the internet. The `with` statement helps ensure your files open *and* close properly. It's more important for more advanced programming purposes, but it's good to know about.

This method is very similar to the first method, except you put the word `with` before `open()`, and then you give the file object a nickname. We will call it `read_file_object` but you can actualy call it whatever you want (often people just use `f` but we want to be more descriptive!). This nickname `as read_file_object` is followed by a colon `:`. Then you read the file as normal.

In [None]:
with open('Charles-Dickens-excerpt.txt', encoding='utf-8') as read_file_object:
    text = read_file_object.read()

In [None]:
print(text)

You use the same structure for writing files but you might give the file object a different nickname `write_file_object`.

In [None]:
with open('text-file-made-by-with.txt', mode='w', encoding='utf-8') as write_file_object:
    write_file_object.write('I made this file with a with statement!')