# Lesson 8: File I/O

Jurre Hageman

Often, you want to read data in your program. When you start a Python session, your data will not be present yet in collections such as strings, lists, tuples amd dictionaries. These collections only live in memory. Not in a persistent state. To create such collections, you often start with reading data from a persistent state (files). This lesson, we will be dealing with reading from files and writing to files. Python can open many different file types. This course will concentrate on ASCII files (also called plain text files). A text file, is a file in which each byte represents one character according to the ASCII code. There is no layout such as bold, superscript etc. Remember that we already opened text files in lesson 2. Let's start with a short summary from that lesson.

The most basic file type is the text file or ASCII file. This is a file that you can open with a text-
editor and yields readable text:

In [2]:
# you do not need to understand the code below yet.
import platform
os_type = platform.system()
if os_type == "Windows":
    !more file1.txt
else: # must be Unix-like, thus cat is probably installed. 
    !cat file1.txt

This is a text file.
If you open it with a text editor, you will be able to read the text.
End of message...


In the code above we used the command `more` (installed on Windows) or the command `cat` (installed on Unix-like systems) to read the content of the file. We can do that with Python too:

In [4]:
filename = "file1.txt"
file_object = open(filename)
print(file_object)
file_content = file_object.read()
print(file_content)

<_io.TextIOWrapper name='file1.txt' mode='r' encoding='UTF-8'>
This is a text file.
If you open it with a text editor, you will be able to read the text.
End of message...



- The first line specifies the path to the filename. Because the file is in the same directory as this notebook file, we only need to specify the name of the file.
- The `open` function will return a file object. The file content is not read yet.
- The file object is printed to show that the content of the file is not read yet in order to save memory.
- The `read` method of the file object is called to read the content of the file and the content is returned in a multi-line string. The variable `file-content` is assigned to this string.
- The string is printed


> It is better not to use / and \ to specify the path as Windows and Unix systems use them differently. In Informatics 2, you will learn how to deal with file paths in an OS-agnostic way.

## Reading files in streaming mode

While our previous method works, it is often not adviced to work with large files this way. The file-object.read() method allocates a lot of memory. If you use large files, it is better to work in streaming mode. Let's repeat the previous example in streaming mode:

In [7]:
filename = "file1.txt"
file_object = open(filename)
for line in file_object:
    print(line, end="")

This is a text file.
If you open it with a text editor, you will be able to read the text.
End of message...


Note that we use a for loop to read through the content of the file. You can only do this once. If you try to do it again you will observe that the file object is exausted:

In [29]:
for line in file_object:
    print(line, end="")
print("This is used to show that this cell is executed")

This is used to show that this cell is executed


So if you want to print the content of the file again, you need to create a new object using the open function:

In [14]:
filename = "file1.txt"
file_object = open(filename)
for line in file_object:
    print(line, end="")

This is a text file.
If you open it with a text editor, you will be able to read the text.
End of message...


To show you the difference in memory usage, I will use some code that you do not need to understand yet but it does show the amount of memory used. First, reading the content of the file using file-object.read() method: 

In [15]:
import sys
my_file = open('file1.txt')
content = my_file.read()
print(sys.getsizeof(content), 'bytes')

158 bytes


Now reading the same file in streaming mode:

In [16]:
my_file = open('file1.txt')
for line in my_file:
    print(sys.getsizeof(line), 'bytes')

70 bytes
119 bytes
67 bytes


As you can see, processing a file line-by-line allocates less memory. Because Bio-informaticians often work with very large files, processing of files line-by-line is often preferred.

## csv files

Often, researchers use text files to store data and store it in a table. If comma’s are used, such a file is a comma-separated text file or
csv file. Instead of comma’s, other column separators (tabs, semi-colons, etc.) can be used.

You can open csv files in e text editor:  
![pic](pics/fig1.png)

You can open csv files in a spreadsheet:  
![pic](pics/fig2.png)

And (of course), you can open csv files using Python:  
And I will store the data in a dictionary.

In [21]:
aa = {}
filename = "file2.csv"
file_object = open(filename)
for line in file_object:
    line = line.strip() # strip the newlines
    data = line.split(",")
    full_name = data[0]
    single_letter = data[1]
    aa[full_name] = single_letter
print(aa)

{'alanine': 'A', 'arginine': 'R', 'asparagine': 'N', 'aspartic acid': 'D', 'cysteine': 'C', 'glutamic acid': 'E', 'glutamine': 'Q'}


## File modi

There are different modi a file object can be in:
- read or 'r'
- write or 'w'
- append or 'a'  

The are more modi but we will concentrate on these three.

Read is default so you do not have to explicitly define it, although it does not hurt if you do:

In [23]:
my_file = open('file1.txt', 'r') # explicit in read mode
for line in my_file:
    print(line, end='')

This is a text file.
If you open it with a text editor, you will be able to read the text.
End of message...


If you only read a file, closing it is not very important. Python will close it for you when the script stops. It is, however good practice to close your file after use and it is **very important** when you write stuff to files.

In [24]:
my_file = open('file1.txt', 'r') # explicit in read mode
for line in my_file:
    print(line, end='')
my_file.close() # explicitly close your files.  

This is a text file.
If you open it with a text editor, you will be able to read the text.
End of message...


> In informatics 2, you will learn to work with a `with` statement. With is a context manager that will deal with closing of files automatically and can save you a lot of trouble! For now, you don't need it. Just remember that for reading text files, closing the file object is not important but for saving content to a file it is. It is nevertheless good practice to always explicitly close your file object.

## Write data using the print function

You can write data to a file using the print function:

In [45]:
my_file = open("hello.txt", "w") # write mode
print("hello", file=my_file)
print("This is used to show that this cell is executed")

This is used to show that this cell is executed


As you can see, the string `hello` is not printed to screen but it is written to the file hello.txt:

In [46]:
for i in open('hello.txt'):
    print(i, end='')

hello


Note that the file modus used is write mode. That means that the content of the file will be overwritten each time the code is executed:

In [49]:
my_file = open("hello.txt", "w") # write mode
print("bla bla", file=my_file) # different string is written to the file

for i in open('hello.txt'):
    print(i, end='') # hello is replaced by bla bla

bla bla


The `file=` parameter in the print function is often used to write messages to a log file. It is usefull to write to a log file in append mode:

In [64]:
!rm log.txt 

In [65]:
seq = "gatc"
log = open("log.txt", "a")
print("sequence converted to:", seq, file=log)
log.close()

Read the file:

In [66]:
for line in open("log.txt"):
    print(line, end="")

sequence converted to: gatc


Add a new log entry:

In [67]:
seq = "cccc"
log = open("log.txt", "a")
print("sequence converted to:", seq, file=log)
log.close()

In [None]:
Read the file again:

In [68]:
for line in open("log.txt"):
    print(line, end="")

sequence converted to: gatc
sequence converted to: cccc


## Write data using file_object.write()

Alternatively to the print function, you can also write to a file using the `file_object.write()` method. Here is an example: 

In [72]:
sequences = ["GAATC", "CAACC", "GAGGG", "TTTTT", "AAAA"]
seq_file_obj = open("seq.txt", 'w')
for seq in sequences:
    seq_file_obj.write(seq)
seq_file_obj.close()
print("done")

done


And read the content of the file:

In [73]:
for line in open("seq.txt"):
    print(line, end="")

GAATCCAACCGAGGGTTTTTAAAA


Oops, thats not what we wanted. Of course we want a newline "\n" after each sequence. The print function adds a newline by default. We can add one here as well:

In [74]:
sequences = ["GAATC", "CAACC", "GAGGG", "TTTTT", "AAAA"]
seq_file_obj = open("seq.txt", 'w')
for seq in sequences:
    seq_file_obj.write(seq + "\n")
seq_file_obj.close()
print("done")

done


In [None]:
Read the file:

In [76]:
for line in open("seq.txt"):
    print(line, end="")

GAATC
CAACC
GAGGG
TTTTT
AAAA


The wrap it up:  
- The open function creates a file object
- The content of the file is not loaded in order to save memory
- You can read the file using the `file_object.read()` method
- For large files it is better to iterate through the file object and to process the file line-by-line
- You can close the file object by calling the `file_object.close()` method
- Closing files is very important when you write to files
- File objects have multiple modi. Most important are read ('r'), write ('w') and append ('a')
- You can save content to the file object using the `print` function with the `file=` argument or using the `file_object.write()` method.

The end...