# Learning module: Parsing strategies

Parsing is an essential part of bioinformatics. There are many different file formats in bioinformatics and also many options for data structures to store the data in. In this module, you will practice your skills in parsing with python to:

1. Parse a text file and store each element in a convenient data structure
2. Use these elements for your analyses
3. Produce properly formatted output text files with your results

To follow the learning module, read and execute the code and instructions for each section in order.

## 1.   Parsing a text file
The first step will be to open and read the file. Then, you will test how different data structures can help you store the data.

### Why can't we just open and read the file?
You can open any text file with the built-in `open()` function. 

Opening a file allows you to: 
- "read it" (visualizing it, with the `'r'` argument passed on to the `open()` function), or
- "write it" (modifying it, with `'w'` passed as an argument)

but either way no data is actually imported to your environment as an object. For that purpose, you will need to `read` the lines in the text file and store them in an object.

See below a way to get a string containing the data from the file, stored under the `files` directory:


In [8]:
path = 'files/P3_argonaut.gb'
with open(path, 'r') as f:
        data = f.read()

The strategy used to parse any text file should aim to be as useful as possible to convert its *data format* (how information is displayed to you in the file) into the adequate *data structures* (the logical units that python can store and use).

The approach used above (`f.read()`) returns a string variable that is legible to you, but still useless for the analyses. Run the cell below to see the structure type of `data` and print it in variable lengths (you can play with the indeces, it will print the first 300 characters by default). Then, discuss with your partner(s) why this data structure is not yet useful for your analyses.

In [9]:
type(data)
print(data[0:300])

LOCUS       NM_179453               3507 bp    mRNA    linear   PLN 21-AUG-2009
DEFINITION  Arabidopsis thaliana AGO1 (ARGONAUTE 1); endoribonuclease/ miRNA
            binding / protein binding / siRNA binding (AGO1) mRNA, complete
            cds.
ACCESSION   NM_179453
VERSION     NM_179453.2  GI:


### Defining the problem
The format of the file you will be working with is *.gb* (GenBank Flat File). Open the [Sample GenBank Record](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) for a detailed description of each field in a *.gb* file. 

Discuss with your partner:
- Which fields are relevant to the assignment? 
- How are fields structured? 
- Which characters or text structures delimit these fields in the text?

![gb](files/gb.png)

Out of all entries in a *.gb* file, only these are relevant for the assignment:
1.  Accession number
2.  Organism name
3.  Sequence length
4.  Sequence (used also for %GC calculations)

Notice that this sample contains a single GenBank entry, while your assignment file contains several records, all with their accession numbers.

#### Selecting a data structure type
Discuss with your partner(s) which of the data structures below seem to better capture the structure of the data in a *.gb* file for the assignment at hand.
- A)    **A list or set** containing a *string* for each entry. Each item in the list is an array of all characters in a GenBank entry. 
- B)    **A list of lists or tuples**, where each item in the list is an ordered list or tuple with all relevant data.
- C)    **A dictionary of lists**, where each key corresponds with a list of lists or tuples with all relevant data.
- D)    **A dictionary of dictionaries (nested dictionary)**, where each dictionary key is an entry and each value is a dictionary of the fields of that entry.


![structures](files/structures.png)


For this assignment, the data for each entry presents a relationship of one (the Accession Number) to many (the rest of relevant fields). For this reason, the best approaches would be either **C** or **D**, where the accession number (dictionary key) is paired with its items (dictionary values).


#### Selecting a parsing strategy
Now

#### Error handling

### Defining the functions for the script

In [10]:
with open(path,'r') as f:
    data = ''.join([line.strip() for line in f])
print(data)

LOCUS       NM_179453               3507 bp    mRNA    linear   PLN 21-AUG-2009DEFINITION  Arabidopsis thaliana AGO1 (ARGONAUTE 1); endoribonuclease/ miRNAbinding / protein binding / siRNA binding (AGO1) mRNA, completecds.ACCESSION   NM_179453VERSION     NM_179453.2  GI:145361472KEYWORDS    .SOURCE      Arabidopsis thaliana (thale cress)ORGANISM  Arabidopsis thalianaEukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;rosids; malvids; Brassicales; Brassicaceae; Camelineae;Arabidopsis.COMMENT     REVIEWED REFSEQ: This record has been curated by TAIR. Thereference sequence was derived from AT1G48410.2.On Apr 18, 2007 this sequence version replaced gi:30694319.FEATURES             Location/Qualifierssource          1..3507/organism="Arabidopsis thaliana"/mol_type="mRNA"/db_xref="taxon:3702"/chromosome="1"/ecotype="Columbia"gene            1..3507/gene="AGO1"/locus_tag="AT1G48410"/gene_synonym="ARGONAUTE 1; T1N1

A function that 

In [11]:
len(range(0,300))

300