<h1 id="toctitle">Working with files</h1>
<ul id="toc"/>

Files are important in bioinformatics. We have many file formats:

- FASTA
- GenBank
- FASTQ
- VCF
- BlAST output
- SAM

Often, we need to take a file an tweak its format for existing tools (e.g. fussy FASTA headers).

Other times we need to write a program that will either create input for, or accept output from, other tools.

Today we will talk about:

- reading text
- processing lines in a file
- creating new files
- appending and writing data to files

Later we will talk about:

- renaming
- moving
- copying
- deleting
- doing stuff to each file in a folder

__Important note__: for all the examples and exercises in this (and future) sessions you will need to download and unzip [the course files from here](https://github.com/mojones/eg_ip_2015/archive/gh-pages.zip). 

##Getting data from a file

###Opening a file

Getting data out of a file is a two step process: open then read.

In [1]:
my_file = open("dna.txt")

`open()` is a function that takes on string argument - the name of the file - and returns a __File object__.

File objects are a new type of data that represent a file on disk. They have useful methods, like strings (but unlike strings we can't really print them):

In [2]:
print(my_file)

<open file 'dna.txt', mode 'r' at 0x7fe246e364b0>


###Reading file contents

`read()` is a File object method that returns the contents as a string. It has no arguments.

In [3]:
my_file = open("dna.txt")
print(my_file.read())

ACTGTACGTGCACTGATC



Remember the special character `\n`. Every line includes this new line character at the end. Remove it with the `rstrip()` method:

In [4]:
my_file = open("dna.txt")
my_file_contents = my_file.read()
# remove the newline from the end of the file contents
my_dna = my_file_contents.rstrip("\n")
print(my_dna)

ACTGTACGTGCACTGATC


Notice how this version doesn't have the extra empty line.

##Writing to files

To write to a file we have to use a second argument to open:

In [5]:
my_file = open("out.txt", "w")

`w` stands for write. Once we have opened a file for writing, we can use the `write()` method:

In [6]:
my_file.write("Hello world")

How can we tell if this has worked? We need to open the file in a text editor (IDLE will do fine). 

##Closing files

Once we've finished writing data to a file, we have to close it:

In [7]:
my_file = open("out.txt", "w")
my_file.write("Hello world")
# remember to close the file
my_file.close()

##Summary of all things!

|  __Name__ | __Job__  | __Argument__  | __Returns__  | __Type__  |
|---|---|---|---|---|
| `open()`  | opens a file for reading or writing  | filename, optional mode (both strings)  | File object  | built in function |
|  `read()` | reads the contents of a file  | none  | String  | method of File objects  |
| `rstrip()` | removes characters from end of string (usually newline)| string to remove  | string  | method of string objects |
| `write()`  | writes text to a file | string to write | nothing  | method of File objects |
|   `close()`| closes a file | none | nothing | method of File objects|



##Exercises

You'll need to use the string manipulation material from previous session, so have it open somewhere. 

###Splitting genomic DNA

Look at the file called _genomic_dna.txt_ – it contains the same piece of genomic DNA that we were using in the final exercise from the previous session. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. 

Write a program that will split the genomic DNA into coding and non-coding parts, and write these sequences to two separate files. Hint: use your solution to the last exercise from the previous session as a starting point.

###Writing a FASTA file

A FASTA file stores sequence data and looks like this:

```
>sequence_one
cgatcgatcatcgatgcattgtagctatcg
>sequence_two
acagtagctacgtgtgtcgta
```

Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C.

| __Sequence header__ | __Sequence__ |
|---------------------|---------------|
| ABC123 | ATCGTACGATCGATCGATCGCTAGACGTATCG |
| DEF456 | actgatcgacgatcgatcgatcacgact |
| HIJ789 | ACTGAC-ACTGT--ACTGTA----CATGTG |

###Writing multiple FASTA files

Use the data from the previous exercise, but instead of creating a __single__ FASTA file, create __three__ new FASTA files – one per sequence. The names of the FASTA files should be the same as the sequence header names, with the extension .fasta.



In [2]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [1]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")