# Common biological data files and how to parse them

One very common task for biologists and other scientists is to extract data from large data files or convert between different file formats. This page describes a number of the most common types of file used by biologists, with sample code showing how they might be parsed (Parsing is reading in the file and separating out the information you require). Remember that the ends of lines will contain 'invisible' newline characters (\n) that often need to be removed when processing the line.

The following table will be used for most of the examples to show what the table would look like in each different text file format.

|Species|gene_id|sequence\_length\_bp|percent_gc|oligonucleotide_primer|
|------|------|------|------|------|
|Homo sapiens|EG010293|447|48|AGTAGGTTAGTTAGGT|
|Homo sapiens|EG013928|684|59|AGTACCAGGATGACCA|
|Mus musculus|QH091219|714|47|CATGATGACCAGTAGA|
|Mus musculus|TY492318|631|55|GTAGTGGATTCCATGG|

### Comma-separated values (csv)

CSV files are one of the most common filetypes used to hold biological data, perhaps partially because they are easily imported into and exported from Microsoft Excel and other spreadsheet software.

CSV files contain rows of data where each element in the row is separated by a comma (,). 

```
Species,gene_id,sequence_length_bp,percent_gc,oligonucleotide_primer
Homo sapiens,EG010293,447,48,AGTAGGTTAGTTAGGT
Homo sapiens,EG013928,684,59,AGTACCAGGATGACCA
Mus musculus,QH091219,714,47,CATGATGACCAGTAGA
Mus musculus,TY492318,631,55,GTAGTGGATTCCATGG
```

#### Example code for parsing csv files

```python
with open("csv_file.csv", "r") as f:
    header = next(f) #skips the first line of the file that contains column headers
	for line in f: 
		#remove end of line character then split at each comma (produces a list)
		data = line.rstrip("\n").split(",")
		#print the sequence length field for each line (contained in the 3rd element of the data list (data[2] - lists are zero-indexed)
		print("Sequence Length is {}bp".format(data[2])
```

### Tab-separated values

Many variants similar to csv files are commonly used and tab-separated values files are probably the most common, with other common files being space-delimited text files. The structure of these files is the same as csv files, except that the delimiting character (the character that separates the values on each line of the file) is a tab in tab-separated values files and a space in space-delimited files. Tabs are represented using the escaped character `\t` 

Parsing these files is as simple as changing the character of the delimiter used in the `split()` function. In the code above, the line containing the split method is changed to that below:

```python
data = line.rstrip("\n").split("\t")
```

### FASTA files

FASTA files are a very common filetype for storing DNA/RNA or protein sequence information. The file starts with a line containing a sequence identifier that is prefixed with the 'greater than' (>) character. The following line, or lines, contain the DNA/RNA/protein sequence. FASTA files can contain single or multiple sequences, even up to millions of sequences. 

There are multiple different conventions for the FASTA sequence identifier line: sometimes semi-colon or space-delimited fields are present. It is best to look up the documentation for the software that created your FASTA file to check which schema it uses. 

The box below shows a typical FASTA file containing three sequences:

```
>seq1
ATCTACGTAGCTAGTCAGCGCTAACGATCGGCTAGCTACGTCTAGCGATCGCGCTATCTACTGCATC
>seq2
CATCGATCGATCTACGTAGCTACGCGCTAGCTATCTACTAGCTCTACGCTACGTCACTATCGTCGA
>seq3
GATGCTAGAGATAGACGCTCCACTACGTCCTACTAGCTACGATCGTACTCTCTATCCGTCTCTCTCG
```

When parsing FASTA files in Python, it is important to make sure you allow for sequences that are split between multiple lines (i.e. contain internal line breaks). The following two code examples will extract the sequence identifier and sequence from FASTA format files. By working through the file line by line, these scripts can cope with FASTA files containing large numbers of sequences because they dont need to be loaded into the computer's memory all at once.

The best way to parse FASTA files is often to use [biopython](http://biopython.org/wiki/Biopython). 

The Python scripts below will also process FASTA files quite efficiently, though they use some Python features that you are not yet familiar with yet. They both use user-defined functions - that means you can reuse this piece of code whenever you need to parse a FASTA file. The second script uses a 'generator' - an advanced Python programming concept not included in this course.


In [3]:
def parse_fasta(filename):
    
    sequences = {}  #initiate a dictionary to store ids (keys) and sequences (values)
    for line in f:
        
        if line.startswith('>'): #if it is an id line
            id = line[1:].rstrip('\n') #remove > character and newline
            sequences[id] = '' #enter id in dictionary with blank sequence
        
        else:
            sequences[id] = line.rstrip('\n').rstrip('*') #if it is not an id line, add sequence to dictionary with previous id
    
    return sequences

with open('illumina_reads.fasta') as f:
    seqs = parse_fasta(f) #this returns a dictionary of ids and sequences
    for key in seqs: #loop through keys and print all key value pairs
        print (key, seqs[key])

seq1 ATCTACGTAGCTAGTCAGCGCTAACGATCGGCTAGCTACGTCTAGCGATCGCGCTATCTACTGCATC
seq2 CATCGATCGATCTACGTAGCTACGCGCTAGCTATCTACTAGCTCTACGCTACGTCACTATCGTCGA
seq3 GATGCTAGAGATAGACGCTCCACTACGTCCTACTAGCTACGATCGTACTCTCTATCCGTCTCTCTCG


In [7]:
#Define FASTA parsing function
#This examples uses a generator - a special type of function that we don't cover in this course. 
#It is the most suitable code if you have really large files as it minimises the amount of data that is held in memory.
def parse_fasta_alt(f):
    id, seq = None, []
    
    for line in f:
        line = line.rstrip("\n") #remove newline character
        
        if line.startswith(">"): #if line is a FASTA header line
            if id: 
                yield (id, ''.join(seq)) #yield id and sequence
            id, seq = line, [] #set id to contents of line and blank sequence
            
        else:
            seq.append(line) #if sequence line, append contents to sequence
            
    if id:
        yield (id, ''.join(seq)) #yield id and sequence

#Read file and output each id, sequence and sequence length        
with open('illumina_reads.fasta') as f:
    for id, seq in parse_fasta_alt(f):
        print(id, seq)
        print("Sequence Length is {}".format(len(seq)))

>seq1 ATCTACGTAGCTAGTCAGCGCTAACGATCGGCTAGCTACGTCTAGCGATCGCGCTATCTACTGCATC
Sequence Length is 67
>seq2 CATCGATCGATCTACGTAGCTACGCGCTAGCTATCTACTAGCTCTACGCTACGTCACTATCGTCGA
Sequence Length is 66
>seq3 GATGCTAGAGATAGACGCTCCACTACGTCCTACTAGCTACGATCGTACTCTCTATCCGTCTCTCTCG
Sequence Length is 67
