# ** Files **


In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/ECS780P_Files_topicSummary.mp4",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

# Reading and writing files

Entering input via the keyboard and reading the output from a console obviously isn't a very convenient way of dealing with large amounts of data. Most of the time, we will want to operate on files (e.g. PDB and FASTA files in the context of bioinformatics) that we download from some online database **(*)** and store the results of our computation on the disk for further processing. Dealing with files is luckily straightforward in Python.

**(*)** We might also be using data in CSV files, in a general context. 

**Note** that since the files we will work with sit in the same directory as the notebook, you may need to log in to the server in order to edit them.


# Text *versus* Binary files

There are essentially two types of data files: *text* and *binary*. 

**Text files** can be opened and modified with a plain text editor such as *gedit*, *emacs* or *notepad*. They are not necessarily written in plan English: a Python program, an HTML page, a CSV file or a PDB file are all text files. Text files are generally somewhat human-readable and are portable across different operating systems and editors, (usually) with very minor changes. The disadvantage is that they take up a lot of space on the disk.

**Binary files** contain data in the internal machine representation of the data. They are less portable and are generally used either to talk directly to the machine, or to store large amounts of data efficiently, or to protect intellectual property.
Python bytecode (.pyc) files, compressed (.zip) files and Microsoft Word (.doc) files are all examples of binary files. An attempt at opening them in a text editor will only show 'gibberish' on the screen.

Happily most of the file types used in bioinformatics (and in computing in general) are text files with open format specifications, so we will not worry about binary files here. It is important however to remember, in your future professional life, that sometimes what you really need is not a bigger disk, but only a more compact data representation.


# Reading files

We will now try to read the following FASTA file (**note** this will not work unless the file actually is in the same directory as this notebook):

In [None]:
# Ignore, this is just an IPython trick
# that creates the link below
from IPython.display import FileLink
FileLink('P04637.fas')

Reading this file in Python requires first a call to **open**, that returns a file handle:

In [None]:
FASTA = open("P04637.fas", "r")
print(FASTA)

the "r" indicates we are opening this file for *reading*. The handle that's returned is a convenience object used by Python to keep track of all data relative to the file (including location, type, position, etc). Just treat that as you would any other Python object.

### Using *readlines()*

An easy way to read all the file is now to use *readlines()*:

In [None]:
# this won't work twice in a row, see below
everything = FASTA.readlines()
print(everything)

As you can see, *readlines()* returns a list of strings, each corresponding to a line in the file. This exhausts the contents of the file. All that remains to do now is to close the file to free the associated system resources:

In [None]:
FASTA.close()

Note that you cannot run *readlines()* on the same file twice without "rewinding" it (e.g. by closing it and reopening).

### Using instead *readline()*

Using *readlines()* is less than ideal as it slurps up all the file in one go, which is somewhat inconvenient as the file may contain different types of data (in this case the header and the protein proper) or may be too large to fit entirely in the memory. In general, it is better to process it one line at a time, using repeated calls to *readline()*. In this particular example, that also gives us the chance to assemble the protein into a single string. Here is how it works:

In [None]:
FASTA = open("P04637.fas", "r")
header = FASTA.readline()
protein = ""
while True:
    oneLine = FASTA.readline()
    if oneLine == "": break
    protein += oneLine.rstrip()
FASTA.close()

# Done. This is just pretty-printing
(code, name) = header.split('|')
print("Accession code:")
print(code)
print("\nName:")
print(name)
print("Protein:")
print(protein)
print("\nNumber of residues:")
print(len(protein))


### Using a "for" loop:

There is a more "Pythonic" way of doing this that uses the fact that a file is an *iterable* and therefore can be read directly using a *for* loop:

In [None]:
FASTA = open("P04637.fas", "r")
header = FASTA.readline()
protein = ""
for oneLine in FASTA: # couldn't be easier!
    protein += oneLine.rstrip()
FASTA.close()

# Done. This is just pretty-printing
(code, name) = header.split('|')
print("Accession code:")
print(code)
print("\nName:")
print(name)
print("Protein:")
print(protein)
print("\nNumber of residues:")
print(len(protein))

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/ECS780P_Files_ReadingFromFiles.mp4",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

# Writing files

Writing files is not much more of a hassle than reading them. In fact, the main steps are the same:
1. open a file
2. write the content
3. close it

The only differences are that the file must be opened in *write mode*, and that the *file.write()* method must be used to actually write data to it.

**NOTE**: Opening a file for writing will erase its previous content (if there was any). However it is possible to open a file for appending. 

Here is an example of appending data to a file:

In [None]:
OUTF = open("greetings.txt", "wt")
# strings can be written directly
OUTF.write("Hello World!\n") # \n means newline
value = ('The Answer', 42) # other stuff needs to be converted to a string
OUTF.write(str(value))
OUTF.close() # makes sure the buffer gets flushed to the disk

In [None]:
FileLink('greetings.txt')

So for instance, writing your candidate protein out in FASTA format can be done this way:

In [None]:
# Your candidate protein
accession = "PXXXX"
description = "My candidate protein - Homo programmaticus (Programmer)" 
sequence = """QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMK
    ILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
	FLFLIKHNPTNTIVYFGRYWSP"""
# get rid of tabs, newlines and spaces in the above string
sequence = sequence.replace(' ','')
sequence = sequence.replace('\t', '')
sequence = sequence.replace('\n', '')

# Ok, now let's write it
OUTF = open(accession + ".fas", "w")
# first the header
header = accession + " | " + description+"\n"
OUTF.write(header)
# output in 60 char lines for convenience
linew = 60
pos = 0
while sequence[pos:pos+linew] != '':
        OUTF.write(sequence[pos:pos+linew]+"\n")
        pos += linew
OUTF.close() # Done!


In [None]:
FileLink(accession + ".fas")

# CSV Files

The CSV (Comma-Separated Values) file format is a popular text file format that lists each record on a separate line. Data fields for the same entry are separated by commas (or occasionaly semicolons or tabs). Many popular packages can output data as CSV files, among others Excel. Reading and writing CSV files in Python is easy, either directly or via the *csv* module.

Example - the file *marksheet.csv* contains the following text:

```
Name, Surname, Mark
John, Smith, 50
Anne, Larsson, 65
Emiliano, Zapata, 95
Donald, Duck, 40
```

When I link to this file, the browser is likely to suggest opening it as a spreadsheet:

In [None]:
FileLink('marksheet.csv')

We can, however, treat this as a normal text file. Suppose that we want to compute the average mark:

In [None]:
FILE = open("marksheet.csv", "r")
FILE.readline() # skip header
total = 0.0
students = 0
for line in FILE:
    entries = line.split(',')
    total += float(entries[2])
    students += 1
FILE.close()

print("Average: ", total/students)

The ```csv``` module offers a slightly more convenient way of accessing the data:

In [None]:
import csv # import the csv module
FILE = open("marksheet.csv", "r")

total = 0.0
students = 0
marksheet = csv.reader(FILE)
marksheet.__next__() # skip first line
for line in marksheet: # already split for us
    total += float(line[2])
    students += 1
FILE.close()

print("Average: ", total/students)

For such a simple file it is hardly worth the trouble; however, the ```csv``` module can handle several types of files (essentially different separators) and automatically guess what software might have generated the file. See the [online help](https://docs.python.org/3/library/csv.html) for details.

The ```csv``` module also provides a ```writer``` object. However, remember that writing a ```csv``` file is essentially only a matter of placing the commas in the right places.

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/ECS780P_Files_WritingToFiles.mp4",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))