# Misctools tutorial

In my experience, bioinformaticians typically do highly situation-specific stuff to their data. So instead of providing a toolbox of tonnes of tools that'll never really apply anyway, I've so far simple made an easy-to-use API so you can patch together your own data analysis, UNIX-style.

Obviously, the first stuff is to import the module.

In [1]:
import misctools as binf

## Fasta sequences

This simple example shows how to open and read a Fasta file, apply a filter and print to an output file.

There's four lines:

    with open('test.fasta') as fastafile, open('filtered.fasta', 'w') as outfile:
        entries = binf.iterfasta(fastafile)
        filtered = (entry for entry in entries if len(entry) >= 150) # generator expression!
        binf.streamprint(filtered, outfile, buffersize=10000)

1) First line opens an input fasta file and an output fasta file, keeping them open until the block is done.

2) Second line creates an iterator of Fasta entries in the input file

3) Third line creates a generator of filtered entries from the iterator in 2)

4) Fourth line prints the content of the generator to the output file using a buffer

Note the following properties:
* Using the `with open` statement, the files automatically close if something goes wrong so that the operating system will not have them open indefinitely, which can lead to bad results
* Before the fourth line, no data has been read. The iterator and generator are simply instructions about *how to process the data* once it's read 
* The streamprint function stores the output line in a buffer with a limited number of outputs. This strikes a balance between limiting disk access, which is slow, and keeping the output in memory, which may cause a big memory footprint
* This code will be fast and memory efficient for files of all sizes: You can read a 100 GB file like this, using only a few MB of memory.

### A more elaborate example

Look through a fasta file for sequences from bacteria. All the ones with a name ending in an uneven number must be reverse complemented. Then translate to amino acids and truncate each of the sequences to before their first stop. Then write to a new file.

No problem!

In [6]:
# Create a generator of lines which does processing on the fly
# Barely any memory footprint!
def process():
    with open('test.fasta') as fastafile:
        entries = binf.iterfasta(fastafile)
        
        for entry in entries:
            
            # Skip non bacteria
            if not 'Bacteria;' in entry.header:
                continue
            
            # Reverse-complement the uneven ones
            name = entry.header.split()[0]
            if int(name[-1]) % 2 == 1:
                entry.reversecomplement()
                
            entry = entry.translate(endatstop=True)
            
            # A sequence with only a stop codon will return None
            if entry is not None:
                yield entry
                
# Stream it to the output file
with open('outfile', 'w') as outfile:
    binf.streamprint(process(), outfile)

## Fastq files

Fastq entries follow basically the same syntax.

Note that since fastq files often have each entry over exactly four lines, one line per attribute, that can be exploited for speed. To do this, set singleline to True.

Let's say we only want the ones where all the 10 last bases are less than 1 in 100 likely to be miscalled

In [16]:
from math import log
maxlogprob = log(1/100)

with open('test.fastq') as fastqfile:
    entries = binf.iterfastq(fastqfile, singleline=True)
    
    # Check whether the reads make sense with a PHRED 33 score
    goodentries = (entry for entry in entries if entry.check(phred=33))
    
    # Filter for the ones with good base calling probabilities
    probable = (entry for entry in goodentries if max(entry.logprobs[-10:]) <= maxlogprob)
    
    # How many are there?
    print(sum(1 for entry in probable))

18


## SAM files

Python is much slower than compiled programs like samtools, so only iterate over SAM files if what you want cannot be achieved with samtools.

For example, let's get all lines which are primary alignments where the partner does not

In [None]:
# Parse a SAM file and count how many headers there are, and how many lines with certain flags
with open('test.sam') as samfile:
    parser = binf.SamParser(samfile)
    
    print(sum(1 for header in parser.iterheaders()))
    print(sum(1 for entry in parser if entry.hasflag('unmapped') and not entry.hasflag('partner unmapped')))