# Learning module: Parsing strategies

Parsing is an essential part of bioinformatics. There are many different file formats in bioinformatics and also many options for data structures to store the data in. In this module, you will practice your skills in parsing with python to:

1. [**Parse a text file and store each element in a convenient data structure**](#1-parse-a-text-file-and-store-each-element-in-a-convenient-data-structure)
2. **Use these elements for your analyses**
3. **Produce properly formatted output text files with your results**

To follow the learning module, read and execute the code and instructions for each section in order using Shift + Enter.

## 1.    Parse a text file and store each element in a convenient data structure
The first step will be to open and read the file. Then, you will test how different data structures can help you store the data.

### 1.1.    Why can't the computer just open and read the file?
You can open any text file with the built-in `open()` function. 

Opening a file allows you to: 
- "read it" (visualizing it, with the `'r'` argument passed on to the `open()` function), or
- "write it" (modifying it, with `'w'` passed as an argument)

but either way no data is actually imported to your environment as a variable. For that purpose, you will need to `read` the lines of the file you just opened, and store them in a suitable data structure.

See below a way to get a string containing the data from the file, stored under the `files` directory:


In [1]:
path = 'files/P3_argonaut.gb'
with open(path, 'r') as f:
        data = f.read()

print(data[0:400]) # Print a few characters of the new data string (an array of characters)
print(type(data))

LOCUS       NM_179453               3507 bp    mRNA    linear   PLN 21-AUG-2009
DEFINITION  Arabidopsis thaliana AGO1 (ARGONAUTE 1); endoribonuclease/ miRNA
            binding / protein binding / siRNA binding (AGO1) mRNA, complete
            cds.
ACCESSION   NM_179453
VERSION     NM_179453.2  GI:145361472
KEYWORDS    .
SOURCE      Arabidopsis thaliana (thale cress)
  ORGANISM  Arabidopsis thali
<class 'str'>


The approach used above (`f.read()`) returns a string variable that is legible to you, but still useless for the analyses. Discuss why with your partner(s).

### 1.2.    Defining the problem
The strategy used to parse any text file should aim to convert its *data format* (how information is displayed to you in the file) into the adequate *data structures* (the logical units that the computer can use).

The format of the file you will be working with is *.gb* (GenBank Flat File). Open the [Sample GenBank Record](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) for a detailed description of each field in a *.gb* file. 

Discuss with your partner:
- Which fields are relevant to the analyses? 
- How are fields structured? 
- Which characters delimit these fields in the text?

![gb](files/gb.png)

Out of all entries in a *.gb* file, only these are relevant for the assignment:
1.  Accession number
2.  Organism name
3.  Sequence length
4.  Sequence (used also for %GC calculations)

Notice that the sample *.gb* file contains a single GenBank entry, while your assignment file contains several records, all with their accession numbers.


### 1.2.    Selecting a data structure type
Discuss with your partner(s) which of the data structures below seem to better capture the structure of the data in a *.gb* file for the assignment at hand.
- A)    **A list or set** containing a *string* for each entry. Each item in the list is an array of all characters in a GenBank entry. 
- B)    **A list of lists or tuples**, where each item in the list is an ordered list or tuple with all relevant data.
- C)    **A dictionary of lists**, where each key corresponds with a list of lists or tuples with all relevant data.
- D)    **A dictionary of dictionaries (nested dictionary)**, where each dictionary key is an entry and each value is a dictionary of the fields of that entry.


![structures](files/structures.png)


In a *.gb* file, the data for each entry presents a relationship of one (the Accession Number) to many (the rest of relevant fields). For this reason, the best approaches would be either **C** or **D**, where the accession number (dictionary key) is paired with its items (dictionary values).
 


## Fill in the blanks: creating the data structure to store all entries in the GeneBank file.

A straight-forward (but not infallible) method to obtain each entry separately could to read the open file as a string and split it at the point where each entry begins or ends. 

Inspect the text file and assign a value to the `start` and `end` variables below. These are, respectively, the characters immediately preceding or succeeding each individual GB entry.

In [2]:
start = "LOCUS"
end = "//"

The next step is to extract all entries into a list. You will test 2 different approaches: `split()` at the `end` or at the `start` of each entry.

Fill in the blanks in the box below as indicated. Discuss with your partner(s) the differences in output between splitting the string with `start` and with `end`. How many entries are really there? Which method(s) give(s) the wrong output, and how can you fix it?

In [10]:
# Open the file:
with open(path, 'r') as f:
    data = f.read()

# split method
entries = data.split(start) # Fill in the blank with either start or end.


print("Length of the list:", len(entries), " entries")
print(entries[2]) # Fill in the blank with some index, from 0 to the total number of entries. 

# Try 0 first, then len(entries)-1 (remember python starts counting from 0).


Length of the list: 8  entries
       NM_001130718            2868 bp    mRNA    linear   INV 03-AUG-2008
DEFINITION  Strongylocentrotus purpuratus argonaute 1 (ago1), mRNA.
ACCESSION   NM_001130718 XM_001176937 XM_001177941 XM_001178074 XM_793314
VERSION     NM_001130718.1  GI:195539326
KEYWORDS    .
SOURCE      Strongylocentrotus purpuratus
  ORGANISM  Strongylocentrotus purpuratus
            Eukaryota; Metazoa; Echinodermata; Eleutherozoa; Echinozoa;
            Echinoidea; Euechinoidea; Echinacea; Echinoida;
            Strongylocentrotidae; Strongylocentrotus.
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from EU733246.1.
            On or before Aug 3, 2008 this sequence version replaced
            gi:115686209, gi:115686211, gi:115972990, gi:115972988.
FEATURES             Location/Qualifiers
     source          1..2868
                     /organism="Strongylocentrotus purpuratus"
   

You have probably encountered some issues, like blank list entries. How can you remove an empty list entry? Fill in the blanks below.

In [11]:
# Remove empty entries using list comprehension:
entries = [entry for entry in entries if entry != ""] # Fill in the blanks at the end.
print("Updated length of the list:", len(entries), " entries \n") # See how the list has changed

print(entries[0])

Updated length of the list: 7  entries 

       NM_179453               3507 bp    mRNA    linear   PLN 21-AUG-2009
DEFINITION  Arabidopsis thaliana AGO1 (ARGONAUTE 1); endoribonuclease/ miRNA
            binding / protein binding / siRNA binding (AGO1) mRNA, complete
            cds.
ACCESSION   NM_179453
VERSION     NM_179453.2  GI:145361472
KEYWORDS    .
SOURCE      Arabidopsis thaliana (thale cress)
  ORGANISM  Arabidopsis thaliana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
            rosids; malvids; Brassicales; Brassicaceae; Camelineae;
            Arabidopsis.
COMMENT     REVIEWED REFSEQ: This record has been curated by TAIR. The
            reference sequence was derived from AT1G48410.2.
            On Apr 18, 2007 this sequence version replaced gi:30694319.
FEATURES             Location/Qualifiers
     source          1..3507
                     /organism="Ara

When you write your parser script, you should aim to make it as resistant to errors like the empty list above. For that purpose, you will need to think about solutions using different matches, regex expressions, error handling, etc.

Data bases contain errors!

Can you write below some code that:
- Inspects every item in the 'entries' list
- Retrieves the accession number (ACCESSION) for each file
- Creates a dictionary with every 'accession number' : entry.

# Prepare for the assignment

Now define: 
- A function that parses out each entry
- 

Remember to use multi-line docstrings to document the usage and parameters of each function.