# Misctools tutorial

In my experience, bioinformaticians typically do highly situation-specific stuff to their data. So instead of providing entire pipelines that'll never really apply anyway,  I've so made a toolbox of functions you can combine to patch together your own data analysis, UNIX-style.

In [2]:
import os
import misctools
import numpytools

## Fasta sequences

Let's instantiate a fasta sequence manually to work with:

In [3]:
thioredoxin_cds = """ATGGTGAAGCAGATCGAGAGCAAGACTGCTTTTCAGGAAGCCTTGGACGCTGCAGGTG
ATAAACTTGTAGTAGTTGACTTCTCAGCCACGTGGTGTGGGCCTTGCAAAATGATCAAGCCTTTCTTTCATGATGTTGC
TTCAGAGTGTGAAGTCAAATGCATGCCAACATTCCAGTTTTTTAAGAAGGGACAAAAGGTGGGTGAATTTTCTGGAGCC
AATAAGGAAAAGCTTGAAGCCACCATTAATGAATTAGTCTAA""".replace('\n', '')

thioredoxin = misctools.FastaEntry("TXN", thioredoxin_cds)
thioredoxin

<Fasta Entry TXN>

---
This object is fast and memory efficent. You can ask its length and slice its sequence directly:

---

In [4]:
print(len(thioredoxin))
print(thioredoxin[18:54]) # This is a string, not a Fasta Entry!

258
AGCAAGACTGCTTTTCAGGAAGCCTTGGACGCTGCA


---
The string representation (`str(thioredoxin)`) is fasta-formatted, with a single line for the sequence. If you want to represent it with the sequence over multiple lines, use the `thioredoxin.format` method:

---

In [8]:
print(str(thioredoxin))
print(thioredoxin.format(width=79))

>TXN
ATGGTGAAGCAGATCGAGAGCAAGACTGCTTTTCAGGAAGCCTTGGACGCTGCAGGTGATAAACTTGTAGTAGTTGACTTCTCAGCCACGTGGTGTGGGCCTTGCAAAATGATCAAGCCTTTCTTTCATGATGTTGCTTCAGAGTGTGAAGTCAAATGCATGCCAACATTCCAGTTTTTTAAGAAGGGACAAAAGGTGGGTGAATTTTCTGGAGCCAATAAGGAAAAGCTTGAAGCCACCATTAATGAATTAGTCTAA
>TXN
ATGGTGAAGCAGATCGAGAGCAAGACTGCTTTTCAGGAAGCCTTGGACGCTGCAGGTGATAAACTTGTAGTAGTTGACT
TCTCAGCCACGTGGTGTGGGCCTTGCAAAATGATCAAGCCTTTCTTTCATGATGTTGCTTCAGAGTGTGAAGTCAAATG
CATGCCAACATTCCAGTTTTTTAAGAAGGGACAAAAGGTGGGTGAATTTTCTGGAGCCAATAAGGAAAAGCTTGAAGCC
ACCATTAATGAATTAGTCTAA


---
You can also create reverse complemented and translated versions of the entry:


---

In [6]:
rvc = thioredoxin.reversecomplemented()
trn = thioredoxin.translated(endatstop=True)

print(rvc)
print(trn)

>TXN
TTAGACTAATTCATTAATGGTGGCTTCAAGCTTTTCCTTATTGGCTCCAGAAAATTCACCCACCTTTTGTCCCTTCTTAAAAAACTGGAATGTTGGCATGCATTTGACTTCACACTCTGAAGCAACATCATGAAAGAAAGGCTTGATCATTTTGCAAGGCCCACACCACGTGGCTGAGAAGTCAACTACTACAAGTTTATCACCTGCAGCGTCCAAGGCTTCCTGAAAAGCAGTCTTGCTCTCGATCTGCTTCACCAT
>TXN
MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMIKPFFHDVASECEVKCMPTFQFFKKGQKVGEFSGANKEKLEATINELV


---
Normally we don't instantiate fasta entries manually. Instead, we work with fasta files.

## Fasta files

There are a few functions for dealing with fasta files:

`iterfasta` - Read fasta files as FastaEntry objects. Use this as the default.

`simple_iterfasta` - Read fasta files as (header, sequence) tuples.

`byte_iterfasta` - this is part of the module `numpytools`... stay tuned.

All of these functions are used the same ways. Each of them are used on an open file to give a generator. While the file is opened, the generator can be iterated over to yield entries:

In [7]:
with open('test/test.fasta') as file:
    entries = misctools.iterfasta(file)
    first_entry = next(entries)
    next_ten_entries = [next(entries) for i in range(10)]
    rest_of_entries = list(entries)
    
first_entry

<Fasta Entry Q5ZMT0 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Archelosauria;Archosauria;Dinosauria;Saurischia;Theropoda;Coelurosauria;Aves;Neognathae;Galloanserae;Galliformes;Phasianidae;Phasianinae;Gallus>

---
There is also the similar `simplefastaiter`, which yields tuples as shown below.

---

In [12]:
with open('test/test.fasta') as file:
    entries = misctools.simple_iterfasta(file)
    first_header, first_sequence = next(entries)
    
first_header, first_sequence

('Q5ZMT0 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Archelosauria;Archosauria;Dinosauria;Saurischia;Theropoda;Coelurosauria;Aves;Neognathae;Galloanserae;Galliformes;Phasianidae;Phasianinae;Gallus',
 'ATCGGTCCATCGTGATTATCGGCCGAGAGACAACCTGAAAACCGACCTCGTATAGCACTTGCCGGTAGATCACAATCCAGGGCCACTAACTAATGACTCTAAAACTTAGACAAGTGGTATCACATACATAATGGGTATTCTGGTTATAAATGGAGCTCCCGTGCATGCATGCGTCGTGTACAATTATAAGTGCGTCCCCCAAGAACCGGGGAGGACTCCTTGACATTATCAGTCCTCATTCTAGGCGATGATCCTGTTGGCCGC')

---
These readers are fast, and only loads the sequences into memory as they are being iterated over. You can safely open that 200 GB fasta file! As always in Python, make sure to use the `with open` syntax to automatically close files.

They only do the most rudementary format checking, though, so make sure the file is actually fasta formatted. 

Let's have a look at how to print to files memory-effectively:

## Stream printers

A stream printer allows you to print an iterator to a file a number of objects at a time. This strikes a balance between limiting disk access, which is slow, and keeping the output in memory, which may cause a big memory footprint. There are two stream printer function. The first one, `streamprint`, is used with open files.

In the code above, we got a few fasta entries. Let's print the rest of the entries to a file, 100 entries at a time:

---

In [9]:
os.mkdir('outputs')

with open('outputs/last_entries.fasta', 'w') as file:
    misctools.streamprint(rest_of_entries, file, bufferlength=100)

---
We can combine a fasta reader with a stream printer to read a file, filter it, and write the filtered entries back using very little memory.

Make sure the filtering is done with a generator expression, though, a list would keep all the entries in memory!

---

In [10]:
with open('test/test.fasta') as fastafile, open('outputs/filtered.fasta', 'w') as outfile:
    entries = misctools.iterfasta(fastafile)
    filtered = (entry.translated() for entry in entries if len(entry) >= 150) # generator expression!
    misctools.streamprint(filtered, outfile)

---
The other stream printer function is the fork printer. This allows you to print to several files memory effeciently. Please be aware though, that it keeps a buffer _per file_, so if you make the buffers hold hundreds of megabytes each, problems can arise.

Say you want to iterate over a fasta file and split it into two files, one with sequences from _Bacteria_ and one with sequences from eukaryotes and archaea:
___

In [11]:
with open('test/test.fasta') as fastafile:
    entries = misctools.iterfasta(fastafile)
    
    # Yield (index, object), to print to the index'th filename given.
    generator = ((0, entry) if 'Bacteria' in entry.header else (1, entry) for entry in entries)
    misctools.forkprint(generator, 'outputs/bacteria.fasta', 'outputs/neomura.fasta')

## Trying together iterfasta, streamprint and FastaEntry:

Look through a fasta file for sequences from bacteria. All the ones with a name ending in an uneven number must be reverse complemented. Then translate to amino acids and truncate each of the sequences to before their first stop. Then write to a new file.

No problem!

---

In [15]:
# Create a generator of lines which does processing on the fly.
# Barely any memory footprint!
def process(entries):
    for entry in entries:
        # Skip non-bacteria
        if 'Bacteria;' not in entry.header:
            continue

        # Reverse-complement the ones with uneven numbers in accession
        accession = entry.header.split()[0]
        if int(accession[-1]) % 2 == 1:
            entry = entry.reversecomplemented()

        entry = entry.translated(endatstop=True)

        # A sequence with only a stop codon will return None
        if entry is not None:
            yield entry
                
# Stream it to the output file
with open('test/test.fasta') as fastafile, open('outputs/fixed.fasta', 'w') as outfile:
    generator = process(misctools.iterfasta(fastafile))
    misctools.streamprint(generator, outfile)

---
## Opening files which may be gzipped

The Opener class can be used instead of the `open` function if you don't know if the file will be gzipped or not.
The class reads the first few bytes upon opening to check the signature. If it's that of a gzipped file, it returns a map object that yields unzipped, decoded lines. If not, it works as an ordinary `open`:

---

In [16]:
with misctools.Reader('test/test.fasta') as lines:
    print(next(lines))
    print(next(lines))

>Q5ZMT0 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Archelosauria;Archosauria;Dinosauria;Saurischia;Theropoda;Coelurosauria;Aves;Neognathae;Galloanserae;Galliformes;Phasianidae;Phasianinae;Gallus

ATCGGTCCATCGTGATTATCGGCCGAGAGACAACCTGAAAACCGACCTCGTATAGCACTTGCCGGTAGATCACAATCCAGGGCCACTAACTAATGACTCTAAAACTTAGACAAGTGGTATCACATACATAATGGGTATTCTGGTTATAAATGGAGCTCCCGTGCATGCATGCGTCGTGTACAATTATAAGTGCGTCCCCCAAGAACCGGGGAGGACTCCTTGACATTATCAGTCCTCATTCTAGGCGATGATCCTGTTGGCCGC



In [17]:
with misctools.Reader('test/gzipped.gz') as lines:
    print(next(lines))
    print(next(lines))

>Q5ZMT0 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Archelosauria;Archosauria;Dinosauria;Saurischia;Theropoda;Coelurosauria;Aves;Neognathae;Galloanserae;Galliformes;Phasianidae;Phasianinae;Gallus

ATCGGTCCATCGTGATTATCGGCCGAGAGACAACCTGAAAACCGACCTCGTATAGCACTTGCCGGTAGATCACAATCCAGGGCCACTAACTAATGACTCTAAAACTTAGACAAGTGGTATCACATACATAATGGGTATTCTGGTTATAAATGGAGCTCCCGTGCATGCATGCGTCGTGTACAATTATAAGTGCGTCCCCCAAGAACCGGGGAGGACTCCTTGACATTATCAGTCCTCATTCTAGGCGATGATCCTGTTGGCCGC



In [18]:
with misctools.Reader('test/gzipped.gz', 'rb') as lines:
    print(next(lines))
    print(next(lines))

b'>Q5ZMT0 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Archelosauria;Archosauria;Dinosauria;Saurischia;Theropoda;Coelurosauria;Aves;Neognathae;Galloanserae;Galliformes;Phasianidae;Phasianinae;Gallus\n'
b'ATCGGTCCATCGTGATTATCGGCCGAGAGACAACCTGAAAACCGACCTCGTATAGCACTTGCCGGTAGATCACAATCCAGGGCCACTAACTAATGACTCTAAAACTTAGACAAGTGGTATCACATACATAATGGGTATTCTGGTTATAAATGGAGCTCCCGTGCATGCATGCGTCGTGTACAATTATAAGTGCGTCCCCCAAGAACCGGGGAGGACTCCTTGACATTATCAGTCCTCATTCTAGGCGATGATCCTGTTGGCCGC\n'


## Rounding to significant figures

The `significant` function simply returns string representing the input number, rounded to a given number of significant figures, 3 by default.

In [19]:
numbers = [32352, 11.543647, 1/3, -11.222, 5]

for number in numbers:
    print('Before:', number, 'After:', misctools.significant(number, 4))

Before: 32352 After: 32350
Before: 11.543647 After: 11.54
Before: 0.3333333333333333 After: 0.333
Before: -11.222 After: -11.22
Before: 5 After: 5.000


---
# Numpytools

A few of the tools require Numpy to work - these are in a separate file, numpytools.

One function, `byte_iterfasta` is a third fasta reader, which returns the sequences
as numpy.uint8 byte arrays which is useful for direct manipulation with functions like numpytools.tnf

To use this, you must open the fasta file in binary reading mode:

---

In [21]:
with open('test/test.fasta', 'rb') as file:
    entries = numpytools.byte_iterfasta(file)
    header, array = next(entries)

print(header)
array # 65, 67, 71, 84 are the ASCII bytes for A, C, G, T

b'Q5ZMT0 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Archelosauria;Archosauria;Dinosauria;Saurischia;Theropoda;Coelurosauria;Aves;Neognathae;Galloanserae;Galliformes;Phasianidae;Phasianinae;Gallus'


array([65, 84, 67, 71, 71, 84, 67, 67, 65, 84, 67, 71, 84, 71, 65, 84, 84,
       65, 84, 67, 71, 71, 67, 67, 71, 65, 71, 65, 71, 65, 67, 65, 65, 67,
       67, 84, 71, 65, 65, 65, 65, 67, 67, 71, 65, 67, 67, 84, 67, 71, 84,
       65, 84, 65, 71, 67, 65, 67, 84, 84, 71, 67, 67, 71, 71, 84, 65, 71,
       65, 84, 67, 65, 67, 65, 65, 84, 67, 67, 65, 71, 71, 71, 67, 67, 65,
       67, 84, 65, 65, 67, 84, 65, 65, 84, 71, 65, 67, 84, 67, 84, 65, 65,
       65, 65, 67, 84, 84, 65, 71, 65, 67, 65, 65, 71, 84, 71, 71, 84, 65,
       84, 67, 65, 67, 65, 84, 65, 67, 65, 84, 65, 65, 84, 71, 71, 71, 84,
       65, 84, 84, 67, 84, 71, 71, 84, 84, 65, 84, 65, 65, 65, 84, 71, 71,
       65, 71, 67, 84, 67, 67, 67, 71, 84, 71, 67, 65, 84, 71, 67, 65, 84,
       71, 67, 71, 84, 67, 71, 84, 71, 84, 65, 67, 65, 65, 84, 84, 65, 84,
       65, 65, 71, 84, 71, 67, 71, 84, 67, 67, 67, 67, 67, 65, 65, 71, 65,
       65, 67, 67, 71, 71, 71, 71, 65, 71, 71, 65, 67, 84, 67, 67, 84, 84,
       71, 65, 67, 65, 84

---
## Tetranucleotide frequences

Using a Numpy array given by byte_iterfasta, you can count the tetranucleotide frequencies. There are two functions for that, `tnf` for the raw tetranucleotide ones, and `markov_normalized_tnf` for the markov normalized ones. Both arrays return a 136-length arrays representing the frequency of all tetranucleotides in order. The length is 136, not 256, as some tetranucleotides are the reverse-complement of each other - i.e. AAAA is the same as TTTT. 

---

In [22]:
numpytools.tnf(array)

array([ 0.00766284,  0.00766284,  0.        ,  0.00383142,  0.00383142,
        0.01532567,  0.        ,  0.00766284,  0.00383142,  0.        ,
        0.00383142,  0.01532567,  0.00383142,  0.00766284,  0.01915709,
        0.00383142,  0.01532567,  0.00383142,  0.00383142,  0.01149425,
        0.00766284,  0.00383142,  0.01532567,  0.00766284,  0.01149425,
        0.00766284,  0.00383142,  0.        ,  0.00766284,  0.00766284,
        0.00383142,  0.01149425,  0.00766284,  0.00766284,  0.00383142,
        0.00383142,  0.        ,  0.        ,  0.00383142,  0.01532567,
        0.00383142,  0.00383142,  0.        ,  0.01149425,  0.01532567,
        0.02681992,  0.01532567,  0.00383142,  0.        ,  0.01915709,
        0.00766284,  0.01532567,  0.01149425,  0.01532567,  0.01149425,
        0.01915709,  0.00766284,  0.00766284,  0.        ,  0.00766284,
        0.01532567,  0.00766284,  0.        ,  0.01149425,  0.00383142,
        0.        ,  0.01149425,  0.00766284,  0.00766284,  0.00

---
## Assembly statistics

In order to make it as quick as possible, this function works with files with lists of contig lengths.
On UNIX systems, you can make these by using the commands:

    For SPAdes: grep '^>' contigs.fasta | cut -d_ -f4 | sort -nr | uniq -c > myfile.txt
    For MEGAHIT: grep '^>' final.contigs.fa | cut -d= -f4 | sort -nr | uniq -c > myfile.txt
    
`assemblystats` takes three arguments. The first is the input filename. The next two
arguments are related to the `sizes` attribute - that is, the assembly sizes at different contig size cutoffs.
For example, to see assembly sizes from 0 to 20 kbp with a step of 2500, do:

---

In [24]:
step = 2500
xmax = 20000
assemblystats = numpytools.assemblystats('test/assemblyinput', xmax=xmax, step=step)

print('Number of contigs:', assemblystats.ncontigs)
print('Smallest contig:', assemblystats.smallest)
print('Largest contig:', assemblystats.largest)
print('N50:', assemblystats.n50)
print('Assembly size:', assemblystats.size)
print('Sizes:')
for n, size in enumerate(assemblystats.sizes):
    print('\t >=', n*step, 'bp:', size)

Number of contigs: 975492
Smallest contig: 250
Largest contig: 112828
N50: 1552
Assembly size: 891411678
Sizes:
	 >= 0 bp: 891411678
	 >= 2500 bp: 343976349
	 >= 5000 bp: 213200795
	 >= 7500 bp: 145077316
	 >= 10000 bp: 104495922
	 >= 12500 bp: 75206234
	 >= 15000 bp: 55686814
	 >= 17500 bp: 42019796
	 >= 20000 bp: 32483707


In [25]:
# Clean up
for directory, __, files in os.walk('outputs/'):
    for file in files:
        os.remove(os.path.join(directory, file))
os.rmdir('outputs/')