# An introduction to solving biological problems with Python

## Session 2.3: Files

- [Using files](#Using-files)
- [Data formats](#Data-formats)
- [Importing modules and libraries](#Importing-modules-and-libraries)
- [Using the `csv` module](#Using-the-csv-module)
- [Python file library](#Python-file-library)

# Data input and output (I/O)

So far, all that data we have been working with has been written by us into our scripts, and the results of out computation has just been displayed in the terminal output. In the real world data will be supplied by the user of our programs (who may be you!) by some means, and we will often want to save the results of some analysis somewhere more permanent than just printing it to the screen. In this session we cover 2 widely used ways of reading data into our programs, via the command line and by reading files from dish, we also discuss writing out data to files. 

There are, of course, many other ways of accessing data, such as querying a database or retrieving data from a network such as the internet. We don't cover these here, but python has excellent support for interacting with databases and networks either in the standard library or using external modules.

## Using files

Frequently the data we want to operate on or analyse will be stored in files, so in our programs we need to be able to open files, read through them (perhaps all at once, perhaps not), and then close them. 

We will also frequently want to be able to print output to files rather than always printing out results to the terminal.

Python supports all of these modes of operations on files, and provides a number of useful functions and syntax to make dealing with files straightforward.

## File objects

To open a file, python provides the `open` function, which takes a filename as its first argument and returns a _file object_ which is python's internal representation of the file.

In [None]:
path = "data/datafile.txt"
fileObj = open( path )

`open` takes an optional second argument specifying the _mode_ in which the file is opened, either for reading, writing or appending.

In [None]:
open( "data/myfile.txt", "r" ) # open for reading, default

In [None]:
open( "data/myfile.txt", "w" ) # open for writing (existing files will be overwritten)

In [None]:
open( "data/myfile.txt", "a" ) # open for appending

__Mode modifiers__

These mode strings can include some extra modifier characters that deal with issues in dealing with files across multiple platforms.

`b`: binary mode, e.g. `"rb"`. No translation for end-of-line chanracters to platform specific setting value.

`U`: universal new line mode, e.g. `"rU"`. Present end-of-line as `"\n"` no matter where the file was written.

## Closing files

To close a file once you finished with it, you can call the `.close` method on a file object.

In [None]:
fileObj.close()

## Reading from files

Once we have opened a file for reading, file objects provide a number of methods for accessing the data in a file. The simplest of these is the `.read` method that reads the entire contents of the file into a string variable.



In [None]:
fileObj = open( "data/datafile.txt" )
print(fileObj.read()) # everything
fileObj.close()

Note that if this means the entire file will be read into memory, if you are operating on a large file and don't actually need all the data at the same time this is rather inefficient.

Frequently, we just need to operate on individuals lines of the file, and you can use the `.readline` method to read a line from a file and return it as a python string.

File objects internally keep track of your current location in a file, so to get following lines from the file you can call this method multiple times.

It is important to note that the string representing each line will have a trailing newline `"\n"` character, which you may want to remove with the `.rstrip` string method.

Once the end of the file is reached, `.readline` will return an empty string `''`. This is different from an apparently empty line in a file, as even an empty line will contain a newline character. Recall that the empty string is considered as `False` in python, so you can readily check for this condition with an `if` statement etc.

In [None]:
# one line at a time
fileObj = open( "data/datafile.txt" )
print("1st line: %r" % fileObj.readline())
print("2nd line: %r" % fileObj.readline())
print("3rd line: %r" % fileObj.readline())
print("4th line: %r" % fileObj.readline())
fileObj.close()

To read in all lines from a file as a list of strings containing the data from each line, use the `.readlines` method (though note that this will again read all data into memory).

In [None]:
# all lines
fileObj = open( "data/datafile.txt" )

lines = fileObj.readlines()

print("The file has", len(lines), "lines")

fileObj.close()

Looping over the lines in a file is a very common operation and python lets you iterate over a file using a `for` loop just as if it were an array of strings. This does not read all data into memory at once, and so is much more efficient that reading the file with `.readlines` and then looping over the resulting list.

In [None]:
# as an iterable
fileObj = open( "data/datafile.txt" )

for line in fileObj:
    print(line.rstrip().upper())

fileObj.close()

### The with statement

It is important that files are closed when they are no longer required, but writing ``fileObj.close()`` is tedious (and more importantly, easy to forget). An alternative syntax is to open the files within a ``with`` statement, in which case the file will automatically be closed at the end of the `with` block.

In [None]:
# fileObj will be closed when leaving the block
with open( "data/datafile.txt" ) as fileObj:
    for ( i, line ) in enumerate( fileObj, start = 1 ):
        print("%s: %r" % ( i, line ))
        

## Writing to files

Once a file has been opened for writing, you can use the `.write` method on a file object to write data to the file.

The argument to the `.write` method must be a string, so if you want to write out numerical data to a file you will have to convert it to a string somehow beforehand.

Remember to include a newline character to separate lines of your output, unlike the `print` statement, `.write` does not include this by default.

In [None]:
read_counts = {
    'BRCA2': 43234,
    'FOXP2': 3245,
    'SORT1': 343792
}

with open( "out.txt", "w" ) as output:
    output.write("GENE\tREAD_COUNT\n")

    for gene in read_counts:
        line = "\t".join( [ gene, str(read_counts[gene]) ] )
        output.write(line+"\n")


To view the output file, open a terminal window, go to the directory where the file has been written, and print the content of the file using `cat` command or open it using your favourite editor:

```bash
cat out.txt
```

Be cautious when opening a file for writing, as python will happily let you overwrite any existing data in the file. 

### [4.2] Exercises

1. Write a script that writes the values of a list of numbers to a file, with each number on a seperate line.

2. Write a script that takes the name of a file containing many lines of nucleotide sequence as a command line argument and opens the file for reading (checking that the filename supplied does exist). For each line in the file, print out the line number and the length of the corresponding line (There is an example file <a href="http://www.ebi.ac.uk/~grsr/perl/dna.txt">here</a> or in `data/dna.txt` from the course materials ).


## Data formats

Bioinformaticians love creating endless new file formats for their data, but there are a number of very common standard formats that it is good to get used to parsing.

Delimited:

### Reading delimited files

We can use the various string manipulation techniques covered earlier to process delimited files in a fairly straightforward way. Here we loop through a file with columns delimited by spaces, reading the data for each row into a list, and storing each of these lists into a main results list.

In [None]:
%%bash
cat data/mydata.txt

In [None]:
results = []

with open("data/mydata.txt", "r") as data:
    header = data.readline()
    for line in data:
        line = line.strip()
        results.append(line.split(" "))
        
print(results)

Here we show a slightly more complicated example where we are reading the results into a more convenient data structure, a list of dictionaries with the dictionary keys corresponding to the column headers and the values to the values from each line. We also convert the columns to an appropriate type as we go.

In [None]:
results = []

with open("data/mydata.txt", "r") as data:
    header = data.readline()
    for line in data:
        idx, org, score = line.strip().split(" ")
        row = {'index': int(idx), 'organism': org, 'score': float(score)}
        results.append(row)
        
print(results)
print('Score of first row:', results[0]['score'])

### Writing delimited files

Writing out a delimited file is also straightforward using the `join` method, or possibly using format strings. Here, as an example we will recreate our original file from above, but this time we will delimit the columns with a comma.

In [None]:
with open('data/mydata.csv', 'w') as output:
    # write a header, using the keys from the first dictionary
    header = ",".join(list(results[0].keys()))
    output.write(header + "\n")
    for row in results:
        vals = [str(v) for v in list(row.values())]
        row_line = ",".join(vals)
        output.write(row_line + "\n")

In [None]:
%%bash
cat data/mydata.csv

Note that there is actually a module in the standard library called `csv` which can also be used to read and write delimited files. There is some example code reading this same file using this library towards the end of this notebook. 

## Importing modules and libraries

Like other laguages, Python has the ability to import external modules (or libraries) into the current program. These modules may be part of the standard library that is automatically included with the Python installation, they may be extra libraries which you install separately or they may be other Python programs you have written yourself. Whatever the source of the module, they are imported into a program via an <tt>import</tt> command.

For example, if we wish to access the mathematical constants pi and e we can use the import keyword to get the module named <tt>math</tt> and access its contents with the dot notation:

In [None]:
import math
print(math.pi, math.e)

Also we can use the `as` keyword to give the module a different name in our code, which can be useful for brevity and avoiding name conflicts:

In [None]:
import math as m
print(m.pi, m.e)

Alternatively we can import the separate components using the `from … import` keyword combination:

In [None]:
from math import pi, e
print(pi, e)

We can import multiple components from a single module, either on one line:

In [None]:
from sys import argv, exit

Or on separate lines

In [None]:
from sys import argv 
from sys import exit

### Listing module contents

Using the method `dir()` and passing the module name:

In [None]:
import math
dir(math)

or directly using an instance, like with this String:

In [None]:
dir("mystring")

In [None]:
# or using the object type
dir(str)

### Getting quick help on method

After listing all contents, you may wish to display specific information on a method using `help()`

In [2]:
help(str.title)

Help on method_descriptor:

title(...)
    S.title() -> string
    
    Return a titlecased version of S, i.e. words start with uppercase
    characters, all remaining cased characters have lowercase.



The most useful information is online on https://www.python.org/ website where the documentation for the
[Python 3 Standard Library](https://docs.python.org/3/library/index.html]) could be found.

## Using the `csv` module

In [None]:
import csv
with open( "data/mydata.txt", "rb" ) as f:
    reader = csv.reader( f, delimiter = " " ) # delimiter defaults to ","
    print(list( reader ))

In [None]:
# Read from list
with open( "data/mydata.txt", "rb" ) as f:
    data = f.readlines()

import csv
reader = csv.reader( data, delimiter = " " )
print(list( reader ))

In [None]:
# Read in as dictionary
results = []
with open( "data/mydata.txt", "rb" ) as fileObj:
    reader = csv.DictReader( fileObj, delimiter = " " ) # do no remove header
    results.extend(list( reader ))
    
print(results)

In [None]:
# Write delimited files using the csv module from a list of list
import csv

mydata = [
    ['1', 'Human', '1.076'], 
    ['2', 'Mouse', '1.202'], 
    ['3', 'Frog', '2.2362'], 
    ['4', 'Fly', '0.9853']
]

with open( "csvdata.csv", "wb" ) as fileObj:
    writer = csv.writer( fileObj, delimiter='\t' )
    writer.writerow( [ "Index", "Organism", "Score" ] ) # write header

    for record in mydata:
        writer.writerow( record )

with open( "csvdata.csv", "rb" ) as f:
    print(f.read())


In [None]:
# Write delimited files using the csv module from a list of dictionaries 
import csv

mydata = [
    {'Index': '1', 'Score': '1.076', 'Organism': 'Human'}, 
    {'Index': '2', 'Score': '1.202', 'Organism': 'Mouse'}, 
    {'Index': '3', 'Score': '2.2362', 'Organism': 'Frog'}, 
    {'Index': '4', 'Score': '0.9853', 'Organism': 'Fly'}
]

fieldnames = ['Index', 'Organism', 'Score']

with open( "csvdictdata.csv", "wb" ) as fileObj:
    writer = csv.DictWriter( fileObj, fieldnames, delimiter='\t' )
    writer.writeheader() # write header

    for record in mydata:
        writer.writerow( record )

with open( "csvdictdata.csv", "rb" ) as f:
    print(f.read())


## Python file library

`os`:

- `chdir(path)` : change the current working directory to be path
- `getcwd()` : return the current working directory
- `listdir(path)` : returns a list of files/directories in the directory path
- `mkdir(path)` : create the directory path
- `rmdir(path)` : remove the directory path
- `remove(path)` : remove the file path
- `rename(src, dst)` : move the file/directory from src to dst

`os.path`:

- `exists(path)` : returns whether path exists
- `isfile(path)` : returns whether path is a “regular” file (as opposed to a directory)
- `isdir(path)` : returns whether path is a directory
- `islink(path)` : returns whether path is a symbolic link
- `join(*paths)` : joins the paths together into one long path
- `dirname(path)` : returns directory containing the path
- `basename(path)` : returns the path minus the dirname(path) in front
- `split(path)` : returns (dirname(path), basename(path))

In [None]:
import os.path
os.path.join( "home", "test", "mydoc.txt" )
# home/test/mydoc.txt - Unix
# home\test\mydoc.txt - Windows

__[4.3] Exercises__

1. Write a program that reads in a tab delimited file with 4 columns: gene, chromosome, start and end coordinates. Compute the length of each gene and print the name of each gene and its corresponding length, seperated by a space, to a new file. You can find an example file <a href="http://www.ebi.ac.uk/~grsr/perl/genes.txt">here</a>, or in ` data/genes.txt` directory of the course materials.

2. Write a program that extends the search_gzip_file.py script in the scripts directory to use a file containing sample accession numbers and writes out a csv file containing the accession number and a boolean value whether it is in the *1000genomes* data or not. **Hint**: preprocess the *1000genomes* data into a data structure that allows quick membership tests.

In [None]:
%%bash 
cat scripts/search_gzip_file.py

In [None]:
%%bash
python scripts/search_gzip_file.py SRS006837

### TODO: Bonus exercise

- use pandas to parse this file