##### 30 Oct 2019

# I/O II:  Parsing Files

#### Reading:  _PCfB_ Ch 10 

#### Today's Topics: 

* context manager (`with` statement)
* parsers
* FASTA files
* ad hoc parsing

Note: the techniques described in the last section ("ad hoc parsing") will be used on one of the problems in Project 5.

## Idiom for Working with Files: &nbsp; Use `with` 

Python has a statement named `with` that is useful for working with files

It automatically opens and closes files for us

Replace
```
f = open("filename")
# code that uses f
close(f)
```
by
```
with open("filename") as f:
   # code that uses f
```

#### Example 

Print and count the lines in a file:

In [None]:
linenum = 0

with open('quote.txt') as f:
    for line in f: 
        linenum += 1
        print('line {:d}:  {}'.format(linenum, line.rstrip()))
        
print('found', linenum, 'lines')

#### Terminology 

The `with` statement is a "context manager"
* the body of the statement defines a context in which the variable (`f` in this case) refers to a file
* when we're done with the context Python cleans up everything and discards the context

(There are other types of contexts in Python but we won't use them in this course)

## Parsers

The word "parse" comes from linguistics

![ascii](http://pages.uoregon.edu/conery/Bi410/parse.png)

In computing a **parser** is a function that analyzes the structure of a piece of text
* computer scientists originally applied the term to programming languages:  a compiler parses a program to make sure it is syntactically correct
* now used in data science to refer to any function that processes input data

### Examples 

Suppose we want to get a list of payees (people or businesses we wrote checks to) from the file named `checking.csv`
* remove duplicates, _i.e._ we want each payee to appear once
* print the list in alphabetical order

As a reminder this is what the file looks like:

```
$ cat checking_account.csv

12/02/2018,REGISTER GUARD,11.96
12/03/2018,APL* ITUNES.COM/BILL,0.99
12/04/2018,CHEVRON 0204468,35.52
12/08/2018,MARKET OF CHOICE #9,11.21
12/09/2018,FOOD FOR LANE COUNTY,5.00
12/14/2018,CHEVRON 0204468,16.25
12/21/2018,MARKET OF CHOICE #9,38.76
12/28/2018,MARKET OF CHOICE #9,18.78
12/31/2018,KING ESTATE WINERY,44.40
```

#### Plan 

* initialize an output list
* read each line in the file
* parse each line (break it into separate fields)
* if the payee (first field) is not in the output list append it
* sort and return the list

#### Sandbox 

Make a string using one of the lines, play around with `strip` and `split` to make sure we know how to use them

In [None]:
s = open('checking_account.csv').readline()   # get the first line in the data file

In [None]:
s

In [None]:
s.strip()

In a previous project we saw it's possible to combine calls to `strip` and `split` into a single expression:

In [None]:
s.strip().split(',')

Also recall we can use "tuple assignment" to save the pieces of the line
* the list created by split should always have 3 items
* put three var names on the left side

In [None]:
s = '1/1/1970,Pies R Us,3.14'

In [None]:
a, b, c = s.strip().split(',')
print("payee:", b)

#### Code 

In [None]:
def payee_list(fn):
    res = []
    with open(fn) as f:
        for line in f:
            date, payee, amount = line.strip().split(',')
            if payee not in res:
                res.append(payee.title())
    return sorted(res)

In [None]:
payee_list('checking_account.csv')

### Example: Find Protein Patterns 

This function will return the deflines of sequences that contain a specified pattern
* first argument: a string ("motif") to look for
* second argument: the name of the file to search

#### Aside: FASTA Files 

The file format for this project is known as FASTA (pronounced "fast-uh" or "fast-ay")
* sequence descriptions are lines that begin with a greater-than symbol
  * description lines are called "deflines"
* sequence data is on lines in between deflines

For this project we can assume all sequences are on a single line
* i.e. the file consists of alternating deflines and sequence lines

The file is named `hemoglobin.fasta` (find it in the `data` folder in the Docker container or download it from the server).

In [None]:
! head -4 hemoglobin.fasta

#### Method 

* create an empty list to hold the results
* iterate over the lines in the file
* if a line starts with `'>'` ignore it
* if the motif occurs in the sequence append the defline to the result

#### Code 

This example introduces a new statement:  `continue`
* used only inside loops
* it means "skip the rest of the statements in the body of the loop and go back to the loop header"

In [None]:
def find_pattern(motif, fn):
    lst = []
    with open(fn) as f:
        for line in f:
            if line.startswith('>'):
                continue
            if motif in line:
                lst.append(line.strip())
    return lst

In [None]:
find_pattern('SKYR', 'hemoglobin.fasta')

In [None]:
find_pattern('MVL', 'hemoglobin.fasta')

## Bioinformatics Libraries 

If you plan on working with sequence files you should invest some time learning to use special purpose libraries
* BioPython
* SciKit-Bio

These libraries define new data types to represent sequences and take care of all the work of reading data from files

Example:
```
from FASTA import *

for seq in FASTAReader(fn):
    if name in seq.defline():
        gc_content(seq.sequence())
```

## Ad Hoc Parsers 

Previous examples have been based on widely-used file formats
* CSV (and TSV) records
* FASTA for sequence files (both DNA and amino acid)

Often we need to write a special-purpose parser for our own unique requirements

### Example: Extract Sequence IDs from FASTA 

Suppose we want to make a list of sequence identifiers that appear on deflines in a FASTA file
* we're looking for substrings that start `NP_` 
* the ID includes one or more digits after the underscore

To get an idea of what we're looking for, this shell command prints all the lines in the file that contain the string `"NP_"`
<pre>
$ grep NP_ hemoglobin.fasta

>gi|4504347|ref|NP_000549.1| hemoglobin subunit alpha [Homo sapiens]
>gi|47271417|ref|NP_571332.2| hemoglobin subunit alpha [Danio rerio]
>gi|145301578|ref|NP_032244.2| hemoglobin subunit alpha [Mus musculus]
>gi|52138655|ref|NP_001004376.1| hemoglobin subunit alpha-A [Gallus gallus]
</pre>

#### Method

Note the defline is a series of fields separated by vertical bars.  One way to approach this problem is to use `split` to separate the defline into smaller pieces
* first `split` using a vertical bar
* the sequence ID is in the 4th part
* use split again to separate the ID field into parts before and after the period

#### Sandbox 

Let's read the first line of the data file (which we know is a defline) and use it in some experiments with `split`

In [None]:
s = open('hemoglobin.fasta').readline()

In [None]:
s.split('|')

In [None]:
s.split('|')[3]

In [None]:
idpart = s.split('|')[3]

In [None]:
idpart.split('.')

In [None]:
idpart.split('.')[0]

#### Code

In [None]:
def parse_seq_ids(fn):
    res = []
    with open(fn) as f:
        for line in f:
            if line.startswith('>'):
                idpart = line.split('|')[3]
                res.append(idpart.split('.')[0])
    return res

In [None]:
parse_seq_ids('hemoglobin.fasta')