In [1]:
from __future__ import division, print_function, unicode_literals
import unittest

# Files

## Reading files

Until now, all of the data that we've worked on has had to be typed directly into the Python program we're using. While this is possible for small datasets, it would quickly become painful for anything that we actually would need the power of Python to analyze.

Thankfully, it's easy to open files and read their contents into a string in Python. There's an example file in the directory "assets" called "small_text.txt". Let's open it and read its contents to a variable

In [2]:
with open('assets/small.txt', 'r') as small_file:
    small_data = small_file.read() # small_file is a file object

print(small_data)

Hey there! This is the first file you've opened in python
Files generally consist of lines of text
Each line can be as long as you want! This one is 64 characters!



Let's discuss all the parts of the first line, because this is the boilerplate you will follow to open files in Python pretty much every time.

The open() function opens a "connection" to a file on your hard drive in memory. 

It takes two arguments: the first is a path to the file, and the second is a string that specifies a "mode" in which the file is opened. You can find a full list of modes at https://docs.python.org/3/library/functions.html#open, but for our purposes, we'll worry about 'r', which means read, and 'w', which means write. 

The simplest way to call open would be to do the following:

```python
small_file = open('assets/small.txt', 'r')
# Do stuff with the file, read contents, etc
small_file.close()
```

The reason why we don't write the file opening code this way is because this forces us to remember to explicitly close the file when we're done with it. This is absolutely necessary to do. If we forget to close the file, it could potentially corrupt the data in the file. 

This seems easy enough - but it's better not to risk forgetting this close command. Maybe more importantly, if the program crashes while you're handling the contents of the file, then it will never reach the close command and never close the connection.

When you call 

```python
with open() as variable:
    # Do stuff
```

This is equivalent to running the open command and assigning it to the variable after "as." The big advantage of this method of opening a file is that the file is automatically closed when we leave the indented block after the with statement. This happens even if a statement in the with block causes the program to crash.

## File methods

Like lists, strings, dictionaries, and other objects, files in Python also have a set of associated methods. The two most useful methods that you will encounter are 

1. .read() - which reads the entire contents of the file into a string (including the \n newline characters)
2. .readlines() - which reads each line of the file into a seperate string, and returns a list of the strings

Let's see readlines in action on the same file

In [12]:
with open('assets/small.txt', 'r') as small_file:
    small_data_lines = small_file.readlines() # small_file is a file object

print(small_data_lines)

["Hey there! This is the first file you've opened in python\n", 'Files generally consist of lines of text\n', 'Each line can be as long as you want! This one is 64 characters!\n']


As you may have noticed, each of these lines ends with a newline character (represented by python as "\n".) If you want to get rid of these newlines from each file, there's a useful string method, .strip(), which gets rid of newlines from the end of strings. Let's re-copy these lines into a new list, stripping each one of the newline character:

In [13]:
stripped_lines = []
for line in small_data_lines:
    stripped_lines.append(line.strip())
stripped_lines

["Hey there! This is the first file you've opened in python",
 'Files generally consist of lines of text',
 'Each line can be as long as you want! This one is 64 characters!']

## Iterating over files

.read() and .readlines() are useful functions, but they have a major disadvantage. These functions require us to load the whole file into memory at once as a string. Often you might be working with datasets that take up multiple gigabytes on the hard drive, so trying to load the entire file at once would be significant strain on your computer.

Thankfully, like lists, dictionaries, and many other objects in Python, you can use a *for* loop to visit each individual line of a file. If whatever you want to do with the file does not require the full dataset at once, this will make your program run much, much faster on large datasets.

To see an example of iterating over a file, let's write a program to print only lines from our file that contain an exclamation mark:

In [14]:
with open('assets/small.txt', 'r') as small_file:
    for line in small_file:
        if '!' in line:
            print(line.strip())

Hey there! This is the first file you've opened in python
Each line can be as long as you want! This one is 64 characters!


## Writing to files

In addition to reading from files, we can also write text to new files. We do this in a similar method to reading, but instead of using 'r' as the second argument to open(), we use 'w'. Let's see a quick example of writing to a file, and then reading it's contents back

In [15]:
with open('assets/no_new_lines.txt', 'w') as file_1:
    file_1.write("Here's a line")
    file_1.write("Here's another line")

with open('assets/no_new_lines.txt', 'r') as file_1:
    print(file_1.read())

Here's a lineHere's another line


Notice anything wrong? Although we wrote to the file twice, we only have one line in the file. This is because we did not append newline "\n" characters to the end of the strings that we wrote to the file. Thus, Python just added our second set of characters directly after the first.

Here's the same example with newlines

In [16]:
with open('assets/no_new_lines.txt', 'w') as file_1:
    file_1.write("Here's a line\n")
    file_1.write("Here's another line\n")

with open('assets/no_new_lines.txt', 'r') as file_1:
    print(file_1.read())

Here's a line
Here's another line



Much better!

## More complex example: parsing a FASTA file

FASTA is a file format for storing nucleotide or peptide sequences. The format is quite simple: each line is either a descriptor line (name of a sequence) or the sequence. All lines that start with descriptors begin with a > character. Any lines between descriptors are considered to be part of the sequence. Blank lines are skipped. An example is given below: 
```
>dna_sequence_one_line
ACTGGTTGGTGTAGGTGCCGTGTGCACACGTGGTGCACGTCACGGGCGTGACCA

>dna_sequence_two_lines
ACTGGTTGGTGTAGGTGCCGTGTGCGAATTCGGTGCACGTCACGGGCGTGACCA
TGGTGCACGTCACGGGCGTGACCAACTGGTTGGTGTAGGTGCCGTGTGCACAGT
```

So in this case, we have two sequences stored in the file. Note that the line breaks in the second DNA sequence do not count as meaningful for the sequence.

In [16]:
def parse_fasta(filename):
    output = {}
    current_seq_name = '' # Initialize a variable for the current sequence name that we'll update when we see a >
    with open(filename, 'r') as f:
        for line in f:
            if line.startswith('>'): # Start of a new sequence
                current_seq_name = line[1:].strip() 
                output[current_seq_name] = ''
            else: # Sequence after a name
                output[current_seq_name] += line.strip() # Append the current line to the current sequence
    return output

In [15]:
parse_fasta('assets/simple.fasta')

{'dna_sequence_one_line': 'ACTGGTTGGTGTAGGTGCCGTGTGCACACGTGGTGCACGTCACGGGCGTGACCA',
 'dna_sequence_two_lines': 'ACTGGTTGGTGTAGGTGCCGTGTGCGAATTCGGTGCACGTCACGGGCGTGACCATGGTGCACGTCACGGGCGTGACCAACTGGTTGGTGTAGGTGCCGTGTGCACAGT'}

As you can see, this simple function is quite powerful for dealing with a format that we often see in biology!

# Exercises!

This has been a simpler chapter than most, so I'll just give a few short excercises.

## 1. Calculate the sum of the numbers in a file

The file 'assets/numbers.txt' contains a list of numbers, one per line. Read this file and calculate the sum of all of the numbers in it. Save the answer in a variable called full_sum

In [None]:
class TestSum(unittest.TestCase):
    def test_correct_sum(self):
        self.assertEqual(full_sum, 10021.42, 'Looks like something went wrong!')
    def test_not_string(self):
        self.assertFalse(isinstance(full_sum, str), 'Did you remember to convert each line to a number?')
    
suite = unittest.TestLoader().loadTestsFromTestCase(TestSum)
unittest.TextTestRunner(verbosity=2).run(suite)

## 2. Write lines with an EcoRI cut site from a FASTA file

As we saw in a previous chapter, EcoRI is an enzyme that recognizes the sequence "GAATTC." Here, write a function to iterate through a dictionary with sequence names as the keys, and sequences as the values, like the one generated by parse_fasta(). For every sequence that contains an EcoRI cut site, write it to an opened file in FASTA format (one line starting with > with the name, then another line containing the sequence.) 

Your function should take a file as an input, not a filename. So, to call it you would write

```python
with open('assets/output.fasta', 'w') as f:
    write_ecori_only(fasta_data, f)
```

and within the function body you would not use a with/as statement.

In [32]:
def write_ecori_only(input_sequences, output_file):
    EcoRI_cut_site = 'GAATTC'
    for name, sequence in input_sequences.items():
        if EcoRI_cut_site in sequence:
            f.write('>' + name + '\n')
            f.write(sequence + '\n')

In [40]:
import io

sample_output="""
>dna_sequence_two_lines
ACTGGTTGGTGTAGGTGCCGTGTGCGAATTCGGTGCACGTCACGGGCGTGACCATGGTGCACGTCACGGGCGTGACCAACTGGTTGGTGTAGGTGCCGTGTGCACAGT
"""

class TestEcoRIWriter(unittest.TestCase):
    def test_simple_fasta(self):
        fasta_data = parse_fasta('assets/simple.fasta') 
        with io.StringIO() as f:
            write_ecori_only(fasta_data, f)
            self.assertEqual(f.getvalue(), sample_output)

suite = unittest.TestLoader().loadTestsFromTestCase(TestEcoRIWriter)
unittest.TextTestRunner(verbosity=2).run(suite)

test_simple_fasta (__main__.TestEcoRIWriter) ... ERROR

ERROR: test_simple_fasta (__main__.TestEcoRIWriter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-40-93fc5140e769>", line 12, in test_simple_fasta
    write_ecori_only(fasta_data, f)
  File "<ipython-input-32-432caecddf4f>", line 5, in write_ecori_only
    f.write('>' + name + '\n')
ValueError: I/O operation on closed file

----------------------------------------------------------------------
Ran 1 test in 0.003s

FAILED (errors=1)


<unittest.runner.TextTestResult run=1 errors=1 failures=0>

In [41]:
fasta_data = parse_fasta('assets/simple.fasta') 
with io.StringIO() as f:
    write_ecori_only(fasta_data, f)
    print(f.getvalue())
 

>dna_sequence_two_lines
ACTGGTTGGTGTAGGTGCCGTGTGCGAATTCGGTGCACGTCACGGGCGTGACCATGGTGCACGTCACGGGCGTGACCAACTGGTTGGTGTAGGTGCCGTGTGCACAGT



In [34]:
fasta_data

{'dna_sequence_one_line': 'ACTGGTTGGTGTAGGTGCCGTGTGCACACGTGGTGCACGTCACGGGCGTGACCA',
 'dna_sequence_two_lines': 'ACTGGTTGGTGTAGGTGCCGTGTGCGAATTCGGTGCACGTCACGGGCGTGACCATGGTGCACGTCACGGGCGTGACCAACTGGTTGGTGTAGGTGCCGTGTGCACAGT'}