# Lecture 5.2: Reading and Writing Files
## 1. Opening, Reading and Writing Files

In the previous lectures we covered the Python basics, inspected some of the internal data types and syntax. The real power of programming lays in power of computers to perform thousands of operations in very little time. In other words, coding helps us to interrogate "big data", and for humanities more than we are able to read in a lifetime.

Until now all "data" was mostly confined to mock examples: strings, lists or other values we manually entered. In this lecture we turn finally turn to some more realistic use of coding in the Humanities, and show how Python assists us with analysing larger, external information sources. 

In this series of lecture we focus on:
- Reading files from disk.
- Reading data from the web.
- Performing some analysis on these data.
- Write results to a file.

## 1.1. Locating files: `os` and `path`

Before opening a file, Python has to locate it. Create a string variable that tells your program where it has to to look. 
Generally, Python will look in the current directory where your script (or Notebook such as this one) is located. Therefore you have to create a **relative path**, i.e. starting from where your script is located. 
Let's try to find John Locke's "An Essay Concerning Human Understanding", which is saved in the `data` subdirectory.

In [None]:
file_name = 'data/pg10615.txt'

In the above cell we assigned the relative path to the 'file_name' variable. Using relative path is highly recommended. Not only is this often shorter, it also makes your scripts and data more transportable. The relative path will work on any computer (as long as you don't start moving the folders) while the absolute path only points to the right file on my laptop.

relative path: `'data/pg10615.txt'`

absolute path: `'/Users/kasparbeelen/Documents/Onderwijsea/CTH/lectures/lecture3/data/pg10615.txt'`


[VU] Sometimes you see double dots in the beginning of the file path; this means 'the parent of the current directory'. When writing a file path, you can use the following:

- /     go to the root of the current drive
- ./    go to current directory
- ~/    go home directory
- -    go previous directory
- ../   go to parent directory (one up in the tree)

You can use your Notebook to navigate your computer with as you would do in your terminal. `cd` (change directory) and `ls` (list directory) provide 
For example to go your home directory

Print the current directory:

In [None]:
pwd

Go to the User directory:

In [None]:
cd ~/

List all items in the User directory:

In [None]:
ls ./ 

Go back to the previous directory:

In [None]:
cd -

Go to the parent's parent folder (go two up):

In [None]:
cd ../../

In [None]:
ls ./

Go back to the previous directory:

In [None]:
cd -

Go one up:

In [None]:
cd ..

Go one down do, to the 'lecture3' folder:

In [None]:
cd lecture3

...and we should be home again after a long travel:

In [None]:
pwd

The code you are running in the above cells is not Python, but bash, the command line language. Jupyer Notebook allows you to combine both language to some extent.

## 1.2 Opening Documents

Python has built-in function `open()`, which returns a 'file object'.

`open()` has the following crucial arguments: 
- **location** of the file (see above)
- **mode** combination of characters, indicates the purpose of file opening
- **encoding** encoding of the text file

What do **mode** and **encoding** actually mean?

### 1.2.1 Encoding 

**UTF-8**

You may wonder what an encoding is and what *utf-8* is. For anyone working with texts and computers this is vital to know. Internally, a computer knows no characters whatsoever: every piece of information is represented as numbers (which in turn are represented in a binary format, as zeroes and ones). An encoding specifies which numbers represent which characters. A famous and long-standing encoding scheme is ASCII, in which for example the letter 'A' is encoded using the number 65. ASCII however only has a very limited alphabet and can not encode a lot of writing systems. A modern-day encoding supporting countless writing systems is *unicode* and *utf-8* is a kind of unicode. This the type of encoding that you will want to use for your data whenever possible. Whenever you have a choice, you should use unicode!

### 1.2.2 Mode
[VU]
* **r** = Opens a file for reading only. The file pointer is placed at the beginning of the file.
* **w** = Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
* **a** = Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file
* **t** = Text mode

## 1.3 Reading Documents

Let's try Python to read in a few paragraphs from Locke's "An Essay Concerning Human Understanding".

In [None]:
locke = open('data/locke_excerpt.txt','r')

Reminder: The `open()` function requires the file path as its first argument. The second (optional) argument specifies the *mode* in which the file is opened. The third (optional) argument specifies the encoding of the file.

Even though 'opened' the file in 'read', this function does not return the actual content or text. To assign the text to a variable we have to call the read function on this object

In [None]:
locke_text = locke.read()
print(locke_text)

After reading, it is recommendable to close the file

In [None]:
locke.close()

The code below will rais a ValueError, because the content is no longer accessible after closing the file

In [None]:
locke.read()

## 1.4 `read()`, `readlines()` and `readline()`

In order to *read* the contents of the file, Python provides three related operations. The first operation is `read()`:

`f = open(path,'r').read()` assigns the entire document to a variable `f`:

In [None]:
document = open('data/locke_excerpt.txt','r')
text = document.read()
document.close()

The variable `text` now holds the entire content of the file located at `data/locke_excerpt.txt` as a single string and we can access and manipulate it just like any other string. We can print the first 100 characters of this string:

In [None]:
print(text[:100])

The second operation is `readlines()`, which returns a list of the lines in the file, where each item of the list represents a single line:

In [None]:
document = open('data/locke_excerpt.txt','r')
lines = document.readlines()
print(lines)
print(type(lines))
document.close()

The third operation `readline()` returns the next line of the file, returning the text up to and including the next newline character (*\n*, or *\r\n* on Windows). More simply put, this operation will read a file line-by-line. So if you call this operation again, it will return the next line in the file. Try it out below!

In [None]:
infile = open('data/locke_excerpt.txt', "r")
next_line = infile.readline()
print(next_line)

Repeat pressing `ctrl+enter` below, this show you a new line each time.

In [None]:
print(infile.readline())

But what about **big data**? So far, we managed to load the complete file. But what if the file size ran into the Gigabytes, and we are only interested in a small subsection of the data. Loading the entire file into memory, will significantly slow down your computer (unless you possess one with generous RAM, but even then) 

In [None]:
infile = open('data/locke_excerpt.txt', "rt")
for line in infile:
    print(line)
infile.close()

`infile.close()`. This closes our file, which is a very important operation. This prevents Python of keeping files that are unneccessary anymore still open.

### Intermezzo: The 'newline' character

[MK]The 'newline' character is probably something new to you. If you are dealing with plain text files (typically files whose name ends in the '.txt' extension), your machine uses a special character internally to signal that a new line should begin. Internally, such newlines are represented as `"\n"`. Normally, this character is visualized on your screen as if the enter key were pressed. See what happens below: 

In [None]:
s = "This is the first line.\nThis is the second line."
print(s)

There exists a similar character to encode 'tab' characters, namely `\t`. You can use this character to play around with the indentation of your (e.g. hierarchically structured) output:


In [None]:
s = "First line\n\t* Second line\n\t* Third line\n\t* Fourth line\nFifth line"
print(s)

[MK]In the code block above in which you read the Austen file, the newline is still included with the original line that preceded it in the file: this is why you see all the extra empty lines in the output above! If you wish to remove all preceding and trailing whitespace in a string (newlines, spaces, but also tabs), you can use the `strip()` function:

In [None]:
s = "   strip me!    "
print(s)
print(s.strip())

*Exercise*: loop through file and print each line without the preceding and trailing whitespace.

#### End of intermezzo

## 1.5 Processing Files

Besides printing we can also manipulate the content of the file or extract information from it such as counting the number of lines. 

In [None]:
infile = open('data/pg10615.txt', "rt")
count = 0
for line in infile:
    count+=1
    
print(count)
infile.close()

We can also ignore all lines with less than ten characters.

In [None]:
infile = open('data/locke_excerpt.txt', "rt")
count = 0
new_lines = []
for line in infile:
    
    
print(new_lines)
infile.close()

This is just a small teaser. During the next lectures we 

## 1.6 Context Manager

In many situations you have read in and process large collection of text. Keeping all these files stored in memory is often pointless and might slow down your computer. [VU]In fact, it is good practice to close the file as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [None]:
file.close()

There is actually an easier (and preferred) way to make sure that the file is closed as soon as you don't need it anymore, namely using what is called a **context manager**:

In [None]:
with open(filename, "r") as infile:
    content = infile.read()
    
print(content)

[VU] The main advantage of using the with-statement is that it automatically closes the file once you leave the local context defined by the indentation level. If you 'manually' open and close the file, you risk forgetting to close the file. Therefore, context managers are considered a best-practice, and we will use the with-statement in all of our following code. 

## 1.7 Writing Files

## 1.7.1 Writing data to a file

In [None]:
doc = open('data/test_file.txt','w')
doc.write('Test\nbthis is only a\nTEST!')
doc.close()

## 1.7.2 Writing CSV Files

The previous steps reduced a book to table of word frequencies. For sure, you do not want to repeat this procedure every time but save it as an intermediate result. The optimal format is a CSV file, with CSV abbreviation Comma Separated Value. The comma in this case is called the **delimiter** the value that separates the items on each row. The end of the row is usually by a hard return.

The content of an example CSV 

``
'ideas', 1398
'one', 911
'idea', 886
``



In [None]:
from collections import Counter
from nltk.tokenize import word_tokenize

wf = Counter(word_tokenize(open('data/locke_excerpt.txt').read().lower()))
print(wf)

In [None]:
content = ''
for key,value in wf.items():
    line = key+','+str(value)+'\n'
    content+=line
    
# or more concise
#content = '\n'.join(["{},{}".format(k,v) for k,v in wf.items()])

In [None]:
filename = "data/wf.csv"
with open(filename, "w") as outfile:
    outfile.write(content)

In [None]:
!ls data
!head data/wf.csv