Determining `exp_cov` for velvet from the coverage distribution of the contigs
============================================
How to use this notebook:

* 'activate' cells by clicking on them with the mouse (you will see a blinking cursor)
* execute cells by pressing the ctrl and enter keys simultaneously
* you can also execute code by pressing shift + enter, this will put the cursor in the next cell

The first cell imports some modules and prepares the notebook

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

This code uses the python *pandas* package to read the stats.txt file as a so-calledpandas dataframe

In [None]:
stats = pd.read_table("stats.txt")

Let's look at the first lines of the dataframe (i.e., the imported data). The different columns are:

* ID: name of the node, corresponding to NODE\_ in the contigs.fa file
* lgth: length of the node (sequence) in kmer size  (BUT see the velvet manual)
* out and in: number of connections to other nodes
* long_cov, short1_cov, short1_Ocov, short2_cov, short2_Ocov: coverage of the different input read datasets. 'long' refers to Sanger reads, if used. For detail, see the velvet manual
* long_nb, short1_nb, short2_nb: your etacher has no idea...

In [None]:
stats.head()

We are interested in the short1_cov column which contains the average kmer coverage of each node. Let's use pandas `describe` function to summarise it:

In [None]:
stats.short1_cov.describe()

Now we are going to plot the distribution of the short1_cov data:

In [None]:
y,binEdges=np.histogram(stats.short1_cov,bins=max(stats.short1_cov))
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
plt.plot(bincenters,y,'-')
plt.title("Coverage of nodes in the graph")
plt.xlabel("Coverage")
plt.ylabel("Frequency")
plt.xlim((0,50))
plt.ylim((0,100))
plt.show()

The *peak value* in this histogram can be used as a guide to the best k-mer value for `exp_cov`.