In [1]:
from IPython import display
import pconbr.bench.count

# Code name: pconbr

Project target: Use a kmer counter to perform a pre-correction step on long-read data

## Dataset

### References

| code name       | species         | path                          | genome size | 
|:----------------|:----------------|:------------------------------|------------:|
| s_pneumoniae    | S. pneumoniae   | reference/CP026549.fasta      |      2.2 Mb |
| c_vartiovaarae  | C. vartiovaarae |                               |     ~11.2Mb |
| e_coli_ont      | E. coli         | reference/CP028309.fasta      |       4.7Mb |
| e_coli_pb       | E. coli         | reference/CP028309.fasta      |       4.7Mb |
| s_cerevisiae    | S. cerevisiae   | reference/GCA_002163515.fasta |      12.4Mb |


### Reads
| code name       | species         | path                        | # bases (Gb)| coverage |
|:----------------|:----------------|:----------------------------|------------:|---------:|
| s_pneumoniae    | S. pneumoniae   | reads/SRR8556426.fasta      |         2.2 |   ~1000x |
| c_vartiovaarae  | C. vartiovaarae | reads/ERR18779[66-70].fasta |         1.7 |    ~150x |
| e_coli_ont      | E. coli         | reads/SRR8494940.fasta      |         1.6 |    ~340x |
| e_coli_pb       | E. coli         | reads/SRR8494911.fasta      |         1.4 |    ~297x |
| s_cerevisiae    | S. cerevisiae   | reads/SRR2157264_[1-2]      |       0.187 |     ~15x |



In [2]:
# To download reference genome uncomment next line and execute this cell can take many time
#!./script/dl_ref.sh

In [3]:
# To download data uncomment next line and execute this cell can take many time
#!./script/dl_reads.sh

## Kmer counting

In [4]:
# To perform pcon kmc and jellyfish count on dataset uncomment next line and execute this cell
#!snakemake -s pipeline/count.snakefile all

File benchmark/{counter name}/{dataset codename}.tsv contains time (in second) and memory (in Mb) usage of each run this information was resume in this table.

In [12]:
display.Markdown(pconbr.bench.count.get("time"))

| dataset | k | Jellyfish | Kmc | Pconbr |
|:-|:-|-:|-:|-:|
| c_vartiovaarae | k13 | 334.2362 | 169.5719 | 49.9236 |
| c_vartiovaarae | k15 | 578.6363 | 627.1587 | 63.3266 |
| c_vartiovaarae | k17 | 965.6645 | 680.0812 | 100.9235 |
| e_coli_ont | k13 | 313.0706 | 158.7293 | 46.6151 |
| e_coli_ont | k15 | 891.7445 | 573.1665 | 59.2280 |
| e_coli_ont | k17 | 844.3351 | 629.6508 | 98.9184 |
| e_coli_pb | k13 | 286.7312 | 140.0799 | 42.6363 |
| e_coli_pb | k15 | 1683.8153 | 573.6715 | 50.7555 |
| e_coli_pb | k17 | 988.2953 | 622.1334 | 92.7290 |
| s_cerevisiae | k13 | 49.5876 | 22.2911 | 5.9670 |
| s_cerevisiae | k15 | 111.7834 | 94.5685 | 10.0484 |
| s_cerevisiae | k17 | 243.1594 | 96.3485 | 48.3371 |
| s_pneumoniae | k13 | 0 | 0 | 62.7292 |
| s_pneumoniae | k17 | 888.0235 | 0 | 0 |


In [13]:
display.Markdown(pconbr.bench.count.get("memory"))

| dataset | k | Jellyfish | Kmc | Pconbr |
|:-|:-|-:|-:|-:|
| c_vartiovaarae | k13 | 1578.83 | 2339.24 | 21.78 |
| c_vartiovaarae | k15 | 6203.79 | 10812.85 | 262.18 |
| c_vartiovaarae | k17 | 16391.07 | 11366.94 | 4102.84 |
| e_coli_ont | k13 | 1386.92 | 2223.92 | 22.01 |
| e_coli_ont | k15 | 22121.48 | 10784.25 | 262.40 |
| e_coli_ont | k17 | 16390.98 | 11018.08 | 4103.41 |
| e_coli_pb | k13 | 1992.64 | 2036.66 | 21.71 |
| e_coli_pb | k15 | 31598.27 | 10813.38 | 262.23 |
| e_coli_pb | k17 | 16390.88 | 11118.81 | 4103.17 |
| s_cerevisiae | k13 | 257.45 | 650.79 | 21.77 |
| s_cerevisiae | k15 | 1957.68 | 1716.51 | 262.23 |
| s_cerevisiae | k17 | 16390.80 | 1656.98 | 4101.55 |
| s_pneumoniae | k13 | 0 | 0 | 21.72 |
| s_pneumoniae | k17 | 16391.12 | 0 | 0 |


## PconBr parameter exploration

### Simulated dataset

Read simulate by [Badread](https://github.com/rrwick/Badread) on E. coli CFT073 genome ([ENA id CP028309](https://www.ebi.ac.uk/ena/data/view/CP028309)).

We evaluate identity before pconbr pipeline with diffrente value of k and s.

In [7]:
# Run some snakemake pipeline to test parameter on some dataset
#!snakemake -s pipeline/parameter_exploration.snakefile genomic_kmer
#!snakemake -s pipeline/parameter_exploration.snakefile read_kmer
#!snakemake -s pipeline/parameter_exploration.snakefile bacteria
#!snakemake -s pipeline/parameter_exploration.snakefile yeast

### Effect of k and s on synthetic dataset

#### With genomic kmer

#### With noisy read kmer

#### On real bacteria dataset

#### On real yeast dataset


## Long read correction

To evaluate our correction against other tools we : 
- result against reference genome we use [ELECTOR](//doi.org/10.1101/512889) 
- assembly result (redbean, rala, flye) we use [QUAST](//doi.org/10.1093/bioinformatics/bty266)

### Self correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|
| CONSENT    | [10.1101/546630](//doi.org/10.1101/546630)                               |
| daccord    | [10.1101/106252](//doi.org/10.1101/106252)                               |
| FLAS       | [10.1093/bioinformatics/btz206](//doi.org/10.1093/bioinformatics/btz206) |
| MECAT      | [10.1038/nmeth.4432](//doi.org/10.1038/nmeth.4432)                       |

#### Mapping result

In [8]:
display.Markdown("TODO")

TODO

#### Assembly result

In [9]:
display.Markdown("TODO")

TODO

### Hybrid correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|


#### Mapping result

In [10]:
display.Markdown("TODO")

TODO

##### Assembly result

In [11]:
display.Markdown("TODO")

TODO

## Polishing

