# Files

On cmoputers, persistent information is stored as files on a hard disk. A file is simply a  sequence of bytes (8 bits) that the commputer knows how to interprete. Files can, but need not be human-readable, i.e., written in a format that can be readily understood by looking at its contents.

Textual information is often stored in ASCII-format, and can be read by humans using any editor. Many data files have such an encoding, e.g., FASTA files, CSV files,...  Python has a number of functions and methods to read such files, and process the information they contain.

## Reading from files

The following code fragment illustrates how to open a file, so that we can read its content. Next, we loop over it line by line, and print each.

In [3]:
file_name = 'Data/series.txt'
with open(file_name, 'r') as series_file:
    for line in series_file:
        print(line, end='')

13.4,17.5,16.3,15.9,16.1
4.1,7.3,9.4,13.8,21.17,35.2,49.1
23.7,25.4

Our example file is text format (we can read and interprete it), it has three lines. To illustrate, we will also print the line number (0-based).

In [4]:
file_name = 'Data/series.txt'
with open(file_name, 'r') as series_file:
    for line_nr, line in enumerate(series_file):
        print(f'{line_nr:d}: {line}', end='')

0: 13.4,17.5,16.3,15.9,16.1
1: 4.1,7.3,9.4,13.8,21.17,35.2,49.1
2: 23.7,25.4

Each line represents a number of floating point values, separated by `,`. We will now compute the average of the numbers on a line, and print it.

In [5]:
file_name = 'Data/series.txt'
with open(file_name, 'r') as series_file:
    for line_nr, line in enumerate(series_file):
        data_strs = line.rstrip().split(',')
        data = [float(data_str) for data_str in data_strs]
        average = sum(data)/len(data)
        print(f'{average:.2f}')

15.84
20.01
24.55


The approach above is fairly typical:
  * open a file,
  * read it line by line
    * split the line into data items
    * convert the strings representing data items to the desired data format, if necessary,
    * compute something using the data.

The `open` function takes two arguments: a file name, and the mode. In the example above we want to read a file, so we specify `'r'`. The result of the `open` function is a file handle, which is assigned to the variable `series_file`. On this file hand, we can do read operations, iterating over all its lines in this case. The file remains open in the body of the `with` statement.  When the last statement in the `with` body has been executed, the file is automatically closed.

#### Your turn now: global average

Replace `____` in the code below so that it will print the average of all numbers inthe `Data/series.txt` file.

In [None]:
file_name = 'Data/series.txt'
with open(file_name, 'r') as series_file:
    ____
    for line in series_file:
        data_strs = line.rstrip().split(',')
        data = [float(data_str) for data_str in data_strs]
        ____
print(f'{average:.2f}')

#### Your turn now: show file function

Convert the code fragment we used above to print the entire content of a text file into a function `show` that takes the file name to show as an argument.

#### Your turn now: to `'rU'` or not to `'rU'`?

The book recommends the use of `'rU'` as _the_ mode to use when dealing with text files. However, in the code above, we used simply `'r'`. Check the documentation of the `open` function and explain why we don't use `'rU'`. Try to find out why it `'r'` works as well as `'rU'` with current Python implementations.

## Writing to files

To permanently store information, we write it to a file. First, we have to open the file, and then print the text to it we want to store. The code fragment below will write the integers from 1 to 5, their square  and cube to a file, one lnie for each integer.

In [7]:
file_name = 'Data/to_remove.txt'
with open(file_name, 'w') as math_file:
    for i in range(1, 6):
        print(f'{i};{i**2};{i**3}', file=math_file)

Verify that the file `Data/to_remove.txt` contains what you expect.

The `open` function call and the `with` statement are very similar to the previous example. However, note that the mode is `'w'`, for write, since we want to store data in the file. We used the `print` function we're familiar with, directing its output to the file handle `math_file` which is the value passed as the optional argument `file`.

Note that when opening a file in write mode (`'w'`), an existing file will be overwritten, and its original contents lost. The `'x'` mode can be used to avoid this, if the file already exists, an error will be reported when the file is opened in that mode.

## File formats

### CSV files

Although we have already read from and written to files in column format, that may be somewhat more involved than one might think at first sight. The Python standard library contains an excellent module to deal with this file format, CSV (Comma Separated Values), unsurprisingly called `csv`.

### FASTA files

The BioPython library contains functionality to read and write sequence data from and to files in FASTA format. It is definitely recommended to use this library, and we will talk about extensively at a later stage. However, for the sake of illustration, we will give a brief example of how to read a FASTA file using pure Python.

Our example files contains 5 protein sequence records, each record has an identifier and some (boring) comment on the first line. The subsequent line(s) contain the actual sequence (note that a sequence can span multiple lines). We want to read it into a `dict` where the keys are the sequence identifiers, and the values the sequences.  We will ignore the comments for now.

If a line starts with a `>`, that is the start of a new record, and the `>` character is immediately followed by the sequence identifier. So if we find such a line, we know two things:
  1. the sequence of the previous record was complete read, and we have all information, and
  1. a new record just started, and its identifier is on this line.

Obviously, both observation hold for all recoreds, except the first one.  When we start, the `dict` should be empty, and we've not yet read any sequence data, so that tells us how to initialize the corresponding variables.

In [6]:
file_name = 'Data/protein_seqs.fasta'
sequences = dict()
sequence = ''
with open(file_name, 'r') as fasta_file:
    for line in fasta_file:
        if line.startswith('>'):    # this is the start of a new record
            # unless this is the first sequence record, we've been
            # adding symbols to sequence, and it is not the empty string
            if sequence:
                sequences[seq_id] = sequence
                # we set sequence to empty for the next record
                sequence = ''
            # we will ignore everythin but the sequence ID,
            # so we only want  a single split
            seq_id, _ = line.split(maxsplit=1)
            # the line starts with >, immediately followed by the ID,
            # so we get rid of >
            seq_id = seq_id[1:]
        else:                       # this is part of the current recotd
            sequence += line.strip()
# when we've read the last line, we should save the record data
if sequence:
    sequences[seq_id] = sequence

In [5]:
sequences

{'seq_1906_01': 'TKVPCSLCM',
 'seq_1906_02': 'HQRKLSFM',
 'seq_1906_03': 'PRCEWHCEVCGPGINGLPIFRSMEC',
 'seq_1906_04': 'VCGPGINGLPIFRSMECHNSYPETKVPCSL',
 'seq_1906_05': 'NSYPE'}

The result is what we would expect.

#### Your turn now: FASTA read function

Turn the code above into a function that takes a file name as an argument, and returns the `dict` as a result.

#### Your turn now: join versus +

The implementation of the function for reading FASTA files in the book uses a `list` to store the string that are part of a sequence, and the `join` method to concatenate them. It is noted that this can be faster, why would that be the case?

#### Your turn now: multiple FASTA files

Use the function above to write a function that takes a list of FASTA file names, and returns a `dict` that contains the sequences in all those files.