In [1]:
from IPython import display
import pconbr
import pconbr.bench
import pconbr.bench.count
import pconbr.identity
import pconbr.kmer_count
import pconbr.kmer_count.curve

# Code name: pconbr

Project target: Use a kmer counter to perform a pre-correction step on long-read data

## Dataset

### References

| code name       | species         | path                          | genome size | 
|:----------------|:----------------|:------------------------------|------------:|
| s_pneumoniae    | S. pneumoniae   | referencesCP026549.fasta      |      2.2 Mb |
| c_vartiovaarae  | C. vartiovaarae |                               |     ~11.2Mb |
| e_coli_ont      | E. coli         | references/CP028309.fasta      |       4.7Mb |
| e_coli_pb       | E. coli         | references/CP028309.fasta      |       4.7Mb |
| s_cerevisiae    | S. cerevisiae   | references/GCA_002163515.fasta |      12.4Mb |


### Reads
| code name       | species         | path                        | # bases (Gb)| coverage |
|:----------------|:----------------|:----------------------------|------------:|---------:|
| s_pneumoniae    | S. pneumoniae   | reads/SRR8556426.fasta      |         2.2 |   ~1000x |
| c_vartiovaarae  | C. vartiovaarae | reads/ERR18779[66-70].fasta |         1.7 |    ~150x |
| e_coli_ont      | E. coli         | reads/SRR8494940.fasta      |         1.6 |    ~340x |
| e_coli_pb       | E. coli         | reads/SRR8494911.fasta      |         1.4 |    ~297x |
| s_cerevisiae    | S. cerevisiae   | reads/SRR2157264_[1-2]      |       0.187 |     ~15x |



In [2]:
# To download reference genome uncomment next line and execute this cell can take many time
#!./script/dl_ref.sh

In [3]:
# To download data uncomment next line and execute this cell can take many time
#!./script/dl_reads.sh

## Kmer counting

In [4]:
# To perform pcon kmc and jellyfish count on dataset uncomment next line and execute this cell
#!snakemake -s pipeline/count.snakefile all

File benchmark/{counter name}/{dataset codename}.tsv contains time (in second) and memory (in Mb) usage of each run this information was resume in this table.

In [5]:
display.Markdown(pconbr.bench.count.get("time"))

| dataset | k | Jellyfish | Kmc | Pconbr |
|:-|:-|-:|-:|-:|
| c_vartiovaarae | k13 | 413.5882 | 193.6503 | 53.1597 |
| c_vartiovaarae | k15 | 766.9468 | 784.6563 | 63.4619 |
| c_vartiovaarae | k17 | 1288.2844 | 833.9791 | 252.0244 |
| e_coli_ont | k13 | 403.0423 | 184.5209 | 49.9495 |
| e_coli_ont | k15 | 1135.5411 | 724.3784 | 61.9082 |
| e_coli_ont | k17 | 1166.9966 | 780.5869 | 214.6334 |
| e_coli_pb | k13 | 361.0433 | 157.6295 | 44.4903 |
| e_coli_pb | k15 | 1456.8800 | 732.9156 | 54.2674 |
| e_coli_pb | k17 | 1255.0258 | 766.6578 | 215.9037 |
| s_cerevisiae | k13 | 56.8721 | 24.2400 | 6.6724 |
| s_cerevisiae | k15 | 123.3701 | 130.9554 | 9.8227 |
| s_cerevisiae | k17 | 290.8972 | 133.2776 | 63.6876 |
| s_pneumoniae | k13 | 540.4675 | 265.6703 | 66.9792 |
| s_pneumoniae | k15 | 905.0415 | 870.6088 | 84.3198 |
| s_pneumoniae | k17 | 1301.9399 | 939.4879 | 284.0073 |


In [6]:
display.Markdown(pconbr.bench.count.get("memory"))

| dataset | k | Jellyfish | Kmc | Pconbr |
|:-|:-|-:|-:|-:|
| c_vartiovaarae | k13 | 1581.97 | 2143.56 | 22.00 |
| c_vartiovaarae | k15 | 6204.13 | 10830.66 | 262.54 |
| c_vartiovaarae | k17 | 16391.34 | 11046.30 | 4101.01 |
| e_coli_ont | k13 | 1387.03 | 2219.09 | 22.03 |
| e_coli_ont | k15 | 22121.59 | 10636.99 | 262.75 |
| e_coli_ont | k17 | 16391.30 | 10898.98 | 4103.75 |
| e_coli_pb | k13 | 1993.30 | 2034.21 | 21.91 |
| e_coli_pb | k15 | 35264.41 | 10717.47 | 262.36 |
| e_coli_pb | k17 | 16391.14 | 11079.68 | 4103.34 |
| s_cerevisiae | k13 | 257.59 | 657.64 | 22.02 |
| s_cerevisiae | k15 | 1957.95 | 1442.41 | 262.57 |
| s_cerevisiae | k17 | 16391.10 | 1367.47 | 4092.84 |
| s_pneumoniae | k13 | 1783.90 | 2810.83 | 21.98 |
| s_pneumoniae | k15 | 11002.25 | 11242.54 | 262.57 |
| s_pneumoniae | k17 | 16391.33 | 11273.15 | 4103.57 |


## PconBr parameter exploration

### Simulated dataset

Error rate was evaluate by `samtools stats` line `error rate:`.

Read was simulate by [Badread](https://github.com/rrwick/Badread) on E. coli CFT073 genome ([ENA id CP028309](https://www.ebi.ac.uk/ena/data/view/CP028309)), error rate 5.625682.

We evaluate identity before pconbr pipeline with diffrente value of kmer size (k), number of kmer was required to validate kmer (s), abundance minimal of solid kmer (a).


In [7]:
# Run some snakemake pipeline to test parameter on dataset
#!snakemake -s pipeline/parameter_exploration.snakefile genomic_kmer
#!snakemake -s pipeline/parameter_exploration.snakefile read_kmer
#!snakemake -s pipeline/parameter_exploration.snakefile bacteria
#!snakemake -s pipeline/parameter_exploration.snakefile yeast

### Synthetic dataset

#### Kmer spectrum

In [23]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df = pandas.read_csv("stats/kmer_spectrum/simulated_reads.csv", index_col=0)

fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns],
               layout=go.Layout(yaxis=dict(range=[0, 225_000]),
                                xaxis=dict(range=[0, 125])))

#fig.update_layout(yaxis_type="log")
fig.show()

#### Correction with genomic kmer

Difference between original error rate and the corrected read error rate

In [9]:
display.Markdown(pconbr.identity.genomic_kmer())

| | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k9 | 0.004181 | 0.003905 | 0.003752 | 0.003687 | 0.003685 | 0.003684 | 0.003684 | 0.003683 | 0.003683|
| k11 | 0.751112 | 0.604209 | 0.510506 | 0.449725 | 0.409512 | 0.381365 | 0.360422 | 0.343445 | 0.328332|
| k13 | 9.130946 | 2.757518 | 0.653570 | -0.121977 | -0.438984 | -0.571321 | -0.620935 | -0.632684 | -0.625700|
| k15 | -0.593920 | -1.673538 | -1.757720 | -1.682130 | -1.575601 | -1.470702 | -1.374537 | -1.286678 | -1.206559|
| k17 | -2.266908 | -2.115433 | -1.940693 | -1.786172 | -1.648175 | -1.527566 | -1.421966 | -1.327155 | -1.241321|


#### Correction with reads kmer

Difference between original error rate and the corrected read error rate

In [10]:
display.Markdown(pconbr.identity.read_kmer("simulated_reads"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000037 | 0.000034 | 0.000035 | 0.000033 | 0.000029 | 0.000026 | 0.000021 | 0.000019 | 0.000016|
| k13 | a2 | 0.215460 | 0.168705 | 0.139871 | 0.121776 | 0.110133 | 0.102619 | 0.097446 | 0.093889 | 0.091342|
| k13 | a3 | 0.614623 | 0.456799 | 0.362757 | 0.305179 | 0.268439 | 0.244298 | 0.227834 | 0.216180 | 0.207288|
| k13 | a4 | 1.086319 | 0.766444 | 0.584145 | 0.475263 | 0.406404 | 0.361045 | 0.330015 | 0.307569 | 0.289928|
| k13 | a5 | 1.570259 | 1.051163 | 0.767033 | 0.601419 | 0.498478 | 0.431347 | 0.384782 | 0.351024 | 0.324560|
| k13 | a6 | 2.039368 | 1.292349 | 0.900345 | 0.678051 | 0.542942 | 0.455057 | 0.395071 | 0.351225 | 0.316679|
| k13 | a7 | 2.493715 | 1.492351 | 0.988531 | 0.712274 | 0.547615 | 0.442231 | 0.371084 | 0.319390 | 0.279070|
| k13 | a8 | 2.931029 | 1.654744 | 1.039043 | 0.712068 | 0.522047 | 0.402326 | 0.322633 | 0.265483 | 0.221121|
| k13 | a9 | 3.365345 | 1.786099 | 1.058886 | 0.686901 | 0.475220 | 0.345196 | 0.260191 | 0.199787 | 0.153539|
| k13 | a10 | 3.786424 | 1.891821 | 1.057285 | 0.644004 | 0.415446 | 0.277800 | 0.189894 | 0.128377 | 0.081920|
| k13 | a11 | 4.195971 | 1.978799 | 1.039484 | 0.590793 | 0.348403 | 0.205787 | 0.116646 | 0.055099 | 0.009544|
| k13 | a12 | 4.600994 | 2.051393 | 1.013081 | 0.531695 | 0.278710 | 0.133441 | 0.044626 | -0.015609 | -0.059261|
| k13 | a13 | 4.991666 | 2.115767 | 0.983067 | 0.472319 | 0.211018 | 0.063774 | -0.024156 | -0.082127 | -0.123059|
| k13 | a14 | 5.360656 | 2.172184 | 0.950407 | 0.413952 | 0.145004 | -0.002902 | -0.088663 | -0.144207 | -0.181976|
| k13 | a15 | 5.719616 | 2.220840 | 0.916351 | 0.356448 | 0.082156 | -0.064890 | -0.148049 | -0.200659 | -0.235215|
| k15 | a1 | 0.000999 | 0.000966 | 0.000721 | 0.000506 | 0.000329 | 0.000211 | 0.000147 | 0.000095 | 0.000060|
| k15 | a2 | 4.767306 | 1.479424 | 0.474669 | 0.115645 | -0.032350 | -0.098497 | -0.129742 | -0.144150 | -0.150074|
| k15 | a3 | 3.074111 | 0.587080 | -0.120971 | -0.350492 | -0.426049 | -0.445933 | -0.444476 | -0.433762 | -0.419424|
| k15 | a4 | 1.755079 | -0.154884 | -0.631169 | -0.753906 | -0.770163 | -0.751742 | -0.722120 | -0.689216 | -0.656334|
| k15 | a5 | 0.917499 | -0.665431 | -0.995455 | -1.048267 | -1.024030 | -0.978618 | -0.928425 | -0.878980 | -0.832160|
| k15 | a6 | 0.388983 | -1.001107 | -1.240272 | -1.249489 | -1.198823 | -1.134920 | -1.070547 | -1.009472 | -0.952463|
| k15 | a7 | 0.057286 | -1.217387 | -1.400861 | -1.381880 | -1.314299 | -1.238041 | -1.164411 | -1.095437 | -1.031450|
| k15 | a8 | -0.156403 | -1.359172 | -1.507626 | -1.470442 | -1.391202 | -1.306780 | -1.226644 | -1.152212 | -1.083623|
| k15 | a9 | -0.292327 | -1.452179 | -1.579102 | -1.529730 | -1.442792 | -1.352583 | -1.268011 | -1.189968 | -1.118148|
| k15 | a10 | -0.381732 | -1.514243 | -1.627250 | -1.570054 | -1.477869 | -1.383759 | -1.296145 | -1.215516 | -1.141540|
| k15 | a11 | -0.441332 | -1.557878 | -1.661360 | -1.598764 | -1.502876 | -1.406110 | -1.316314 | -1.233804 | -1.158240|
| k15 | a12 | -0.483853 | -1.588196 | -1.685625 | -1.619293 | -1.520623 | -1.421847 | -1.330484 | -1.246668 | -1.169951|
| k15 | a13 | -0.514198 | -1.609975 | -1.703448 | -1.634636 | -1.533919 | -1.433628 | -1.341046 | -1.256221 | -1.178714|
| k15 | a14 | -0.536252 | -1.626250 | -1.716914 | -1.646170 | -1.543976 | -1.442566 | -1.349097 | -1.263562 | -1.185418|
| k15 | a15 | -0.551807 | -1.638319 | -1.727049 | -1.655017 | -1.551693 | -1.449462 | -1.355315 | -1.269200 | -1.190572|
| k17 | a1 | 0.000603 | 0.000214 | -0.000051 | -0.000154 | -0.000174 | -0.000159 | -0.000147 | -0.000133 | -0.000114|
| k17 | a2 | 0.070559 | -0.453851 | -0.535751 | -0.532813 | -0.509512 | -0.482872 | -0.457117 | -0.433140 | -0.411010|
| k17 | a3 | -0.799379 | -1.045109 | -1.034694 | -0.983366 | -0.925144 | -0.869476 | -0.818672 | -0.772072 | -0.729232|
| k17 | a4 | -1.377328 | -1.456232 | -1.382464 | -1.294939 | -1.209576 | -1.131404 | -1.060958 | -0.996817 | -0.938182|
| k17 | a5 | -1.727605 | -1.705216 | -1.591852 | -1.480734 | -1.377273 | -1.284316 | -1.201452 | -1.126168 | -1.057595|
| k17 | a6 | -1.930558 | -1.851733 | -1.714111 | -1.588179 | -1.473462 | -1.371428 | -1.280928 | -1.198975 | -1.124570|
| k17 | a7 | -2.049304 | -1.939192 | -1.787278 | -1.651645 | -1.529995 | -1.422316 | -1.327184 | -1.241273 | -1.163198|
| k17 | a8 | -2.120380 | -1.993135 | -1.832514 | -1.690852 | -1.564469 | -1.453229 | -1.355043 | -1.266620 | -1.186287|
| k17 | a9 | -2.164963 | -2.027858 | -1.862182 | -1.716407 | -1.586865 | -1.473144 | -1.373062 | -1.282894 | -1.201097|
| k17 | a10 | -2.194589 | -2.051885 | -1.882928 | -1.734459 | -1.602588 | -1.487188 | -1.385610 | -1.294257 | -1.211375|
| k17 | a11 | -2.215076 | -2.068864 | -1.897986 | -1.747648 | -1.614155 | -1.497342 | -1.394744 | -1.302456 | -1.218794|
| k17 | a12 | -2.229928 | -2.081389 | -1.909272 | -1.757677 | -1.622899 | -1.505110 | -1.401663 | -1.308747 | -1.224546|
| k17 | a13 | -2.240788 | -2.090959 | -1.917905 | -1.765411 | -1.629667 | -1.511081 | -1.407030 | -1.313606 | -1.228993|
| k17 | a14 | -2.248726 | -2.098086 | -1.924540 | -1.771409 | -1.634970 | -1.515758 | -1.411294 | -1.317469 | -1.232510|
| k17 | a15 | -2.254555 | -2.103471 | -1.929541 | -1.775932 | -1.638991 | -1.519344 | -1.414523 | -1.320415 | -1.235205|


### E. coli ont dataset

#### Kmer spectrum

In [27]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df = pandas.read_csv("stats/kmer_spectrum/e_coli_ont.csv", index_col=0)

fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns],
               layout=go.Layout(yaxis=dict(range=[0, 100_000]),
                                xaxis=dict(range=[0, 255])))
fig.show()

#### Correction with reads kmer

Difference between original error rate and the corrected read error rate

In [12]:
display.Markdown(pconbr.identity.read_kmer("SRR8494940"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000|
| k13 | a2 | 0.002080 | 0.001930 | 0.001790 | 0.002060 | 0.002190 | 0.002170 | 0.002170 | 0.002170 | 0.002170|
| k13 | a3 | 0.007390 | 0.005960 | 0.006370 | 0.006500 | 0.006010 | 0.005770 | 0.005520 | 0.005580 | 0.005570|
| k13 | a4 | 0.022760 | 0.021010 | 0.016540 | 0.015990 | 0.016240 | 0.016080 | 0.016160 | 0.016100 | 0.016070|
| k13 | a5 | 0.040810 | 0.033690 | 0.031390 | 0.029940 | 0.028690 | 0.028010 | 0.027390 | 0.027480 | 0.026570|
| k13 | a6 | 0.064580 | 0.053350 | 0.049360 | 0.047040 | 0.046350 | 0.045950 | 0.043560 | 0.043670 | 0.044350|
| k13 | a7 | 0.093220 | 0.078590 | 0.069830 | 0.068580 | 0.067590 | 0.065270 | 0.065900 | 0.064790 | 0.064890|
| k13 | a8 | 0.132410 | 0.109700 | 0.096130 | 0.090450 | 0.086220 | 0.082460 | 0.080520 | 0.079270 | 0.078740|
| k13 | a9 | 0.169740 | 0.142450 | 0.124890 | 0.117010 | 0.111130 | 0.105900 | 0.101940 | 0.100750 | 0.100090|
| k13 | a10 | 0.226400 | 0.188750 | 0.169540 | 0.157380 | 0.148080 | 0.140190 | 0.137500 | 0.133460 | 0.132460|
| k13 | a11 | 0.266080 | 0.223700 | 0.200890 | 0.187150 | 0.176720 | 0.168060 | 0.161600 | 0.157670 | 0.156940|
| k13 | a12 | 0.317810 | 0.272910 | 0.238010 | 0.219020 | 0.204140 | 0.195990 | 0.189810 | 0.186000 | 0.184990|
| k13 | a13 | 0.374300 | 0.314450 | 0.272920 | 0.248890 | 0.234030 | 0.225020 | 0.218990 | 0.215380 | 0.212930|
| k13 | a14 | 0.434120 | 0.346160 | 0.304700 | 0.276210 | 0.261580 | 0.251870 | 0.244410 | 0.241080 | 0.238090|
| k13 | a15 | 0.502630 | 0.406740 | 0.347610 | 0.312360 | 0.294500 | 0.282740 | 0.272130 | 0.268380 | 0.264050|
| k15 | a1 | -0.000120 | -0.000090 | -0.000100 | -0.000100 | -0.000110 | -0.000110 | -0.000110 | -0.000120 | 0.000010|
| k15 | a2 | 1.423390 | 0.888880 | 0.577720 | 0.406670 | 0.313520 | 0.255960 | 0.215180 | 0.184910 | 0.161200|
| k15 | a3 | 2.421420 | 1.448110 | 0.908490 | 0.617190 | 0.455280 | 0.346270 | 0.283010 | 0.238800 | 0.203510|
| k15 | a4 | 3.046960 | 1.683210 | 1.006540 | 0.654910 | 0.451250 | 0.331200 | 0.258740 | 0.210900 | 0.168250|
| k15 | a5 | 3.330690 | 1.775760 | 1.004910 | 0.610000 | 0.402730 | 0.275060 | 0.192900 | 0.147750 | 0.111000|
| k15 | a6 | 3.484930 | 1.753000 | 0.913280 | 0.528390 | 0.322170 | 0.207120 | 0.127480 | 0.082390 | 0.057170|
| k15 | a7 | 3.485880 | 1.648830 | 0.823040 | 0.442340 | 0.249870 | 0.136740 | 0.068280 | 0.028370 | 0.000490|
| k15 | a8 | 3.452590 | 1.542000 | 0.724210 | 0.352350 | 0.170830 | 0.062920 | 0.013430 | -0.020800 | -0.043570|
| k15 | a9 | 3.363460 | 1.421180 | 0.606830 | 0.245390 | 0.088810 | -0.003640 | -0.047870 | -0.068580 | -0.086450|
| k15 | a10 | 3.245270 | 1.275790 | 0.501240 | 0.170040 | 0.018110 | -0.057940 | -0.090190 | -0.115330 | -0.130410|
| k15 | a11 | 3.108190 | 1.171710 | 0.410380 | 0.090240 | -0.043890 | -0.100470 | -0.137440 | -0.152640 | -0.164300|
| k15 | a12 | 2.977000 | 1.033030 | 0.301280 | 0.018380 | -0.100430 | -0.149790 | -0.179090 | -0.192030 | -0.193730|
| k15 | a13 | 2.855620 | 0.906530 | 0.219030 | -0.052160 | -0.157390 | -0.194190 | -0.220250 | -0.223460 | -0.220440|
| k15 | a14 | 2.715190 | 0.790090 | 0.132770 | -0.109780 | -0.204390 | -0.233290 | -0.246140 | -0.247270 | -0.243180|
| k15 | a15 | 2.601220 | 0.697330 | 0.064460 | -0.166750 | -0.247320 | -0.271650 | -0.280080 | -0.273840 | -0.268130|
| k17 | a1 | 0.000050 | 0.000000 | -0.000110 | -0.000120 | -0.000140 | -0.000160 | -0.000170 | -0.000240 | -0.000240|
| k17 | a2 | 1.956870 | 0.465040 | 0.048770 | -0.074490 | -0.114260 | -0.122720 | -0.121880 | -0.121830 | -0.118460|
| k17 | a3 | 1.018780 | 0.086340 | -0.143390 | -0.199680 | -0.211110 | -0.203190 | -0.191350 | -0.178470 | -0.168480|
| k17 | a4 | 0.554690 | -0.121230 | -0.252290 | -0.286540 | -0.271460 | -0.258180 | -0.242200 | -0.227570 | -0.213100|
| k17 | a5 | 0.216180 | -0.263960 | -0.342110 | -0.348950 | -0.327410 | -0.305160 | -0.285740 | -0.264320 | -0.248920|
| k17 | a6 | -0.036350 | -0.382830 | -0.413190 | -0.402350 | -0.370430 | -0.345760 | -0.319240 | -0.297690 | -0.276720|
| k17 | a7 | -0.245580 | -0.474910 | -0.470180 | -0.452330 | -0.416950 | -0.381590 | -0.355370 | -0.330780 | -0.307450|
| k17 | a8 | -0.426010 | -0.555840 | -0.515770 | -0.490020 | -0.451070 | -0.410990 | -0.381700 | -0.353930 | -0.328180|
| k17 | a9 | -0.564010 | -0.619060 | -0.559130 | -0.521820 | -0.477220 | -0.435800 | -0.399160 | -0.370730 | -0.346600|
| k17 | a10 | -0.713440 | -0.678480 | -0.598640 | -0.550740 | -0.503500 | -0.457610 | -0.419380 | -0.387600 | -0.358880|
| k17 | a11 | -0.826600 | -0.724850 | -0.631970 | -0.583520 | -0.535010 | -0.487150 | -0.448260 | -0.412140 | -0.377890|
| k17 | a12 | -0.935730 | -0.765310 | -0.661670 | -0.604670 | -0.552340 | -0.506050 | -0.461370 | -0.423870 | -0.389710|
| k17 | a13 | -1.037100 | -0.804080 | -0.688620 | -0.628780 | -0.573420 | -0.522610 | -0.472020 | -0.435730 | -0.398640|
| k17 | a14 | -1.126490 | -0.845210 | -0.717530 | -0.652290 | -0.594070 | -0.540200 | -0.486440 | -0.448230 | -0.409170|
| k17 | a15 | -1.205820 | -0.880030 | -0.744720 | -0.669260 | -0.607280 | -0.549450 | -0.495420 | -0.454610 | -0.416520|


### E. coli pb dataset

#### Kmer spectrum

In [28]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df = pandas.read_csv("stats/kmer_spectrum/e_coli_pb.csv", index_col=0)

fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns])


fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns],
               layout=go.Layout(yaxis=dict(range=[0, 800_000]),
                                xaxis=dict(range=[0, 255])))

fig.show()

#### Correction with reads kmer

In [14]:
display.Markdown(pconbr.identity.read_kmer("SRR8494911"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000|
| k13 | a2 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010|
| k13 | a3 | 0.000070 | 0.000070 | 0.000070 | 0.000070 | 0.000070 | 0.000070 | 0.000060 | 0.000060 | 0.000060|
| k13 | a4 | 0.000170 | 0.000140 | 0.000120 | 0.000110 | 0.000190 | 0.000180 | 0.000180 | 0.000180 | 0.000180|
| k13 | a5 | 0.001040 | 0.000960 | 0.000950 | 0.000840 | 0.000860 | 0.000850 | 0.000840 | 0.000840 | 0.000760|
| k13 | a6 | 0.002410 | 0.002150 | 0.002480 | 0.002310 | 0.002290 | 0.002140 | 0.002140 | 0.002140 | 0.002110|
| k13 | a7 | 0.007530 | 0.007510 | 0.007350 | 0.007250 | 0.007160 | 0.007130 | 0.007010 | 0.006960 | 0.006980|
| k13 | a8 | 0.012980 | 0.012880 | 0.012180 | 0.012550 | 0.011770 | 0.011620 | 0.011770 | 0.011840 | 0.011690|
| k13 | a9 | 0.024840 | 0.023170 | 0.021780 | 0.020860 | 0.020330 | 0.020080 | 0.019940 | 0.019740 | 0.019750|
| k13 | a10 | 0.041550 | 0.035890 | 0.033480 | 0.033150 | 0.031230 | 0.030650 | 0.030390 | 0.030180 | 0.030130|
| k13 | a11 | 0.060120 | 0.053420 | 0.048360 | 0.046320 | 0.045250 | 0.043780 | 0.043340 | 0.043740 | 0.043140|
| k13 | a12 | 0.082800 | 0.073010 | 0.068780 | 0.066030 | 0.063800 | 0.062530 | 0.062150 | 0.061670 | 0.061570|
| k13 | a13 | 0.109990 | 0.096620 | 0.089660 | 0.085540 | 0.083970 | 0.083110 | 0.081610 | 0.080550 | 0.080880|
| k13 | a14 | 0.149850 | 0.128680 | 0.117110 | 0.109680 | 0.105560 | 0.102870 | 0.102770 | 0.101860 | 0.101790|
| k13 | a15 | 0.192410 | 0.160480 | 0.148280 | 0.139020 | 0.132840 | 0.128250 | 0.125290 | 0.124110 | 0.123730|
| k15 | a1 | 0.000010 | 0.000030 | 0.000080 | 0.000110 | 0.000240 | 0.000230 | 0.000230 | 0.000210 | 0.000150|
| k15 | a2 | 1.285890 | 0.869770 | 0.590310 | 0.429650 | 0.321220 | 0.254330 | 0.207720 | 0.175090 | 0.150000|
| k15 | a3 | 2.901260 | 1.729940 | 1.069500 | 0.713960 | 0.490880 | 0.355920 | 0.270980 | 0.216940 | 0.175610|
| k15 | a4 | 3.784430 | 2.038960 | 1.159580 | 0.702470 | 0.447910 | 0.296490 | 0.208470 | 0.152260 | 0.114200|
| k15 | a5 | 4.052320 | 1.978880 | 1.015100 | 0.553140 | 0.302050 | 0.164370 | 0.093890 | 0.048500 | 0.019780|
| k15 | a6 | 3.928510 | 1.715160 | 0.753500 | 0.330360 | 0.121080 | 0.016510 | -0.041450 | -0.065400 | -0.082810|
| k15 | a7 | 3.638710 | 1.368680 | 0.468670 | 0.097780 | -0.052390 | -0.125070 | -0.155010 | -0.158630 | -0.160000|
| k15 | a8 | 3.250950 | 0.990100 | 0.196960 | -0.107210 | -0.207500 | -0.250590 | -0.244420 | -0.234840 | -0.222430|
| k15 | a9 | 2.881530 | 0.652210 | -0.064030 | -0.285910 | -0.355840 | -0.360780 | -0.336790 | -0.310690 | -0.281380|
| k15 | a10 | 2.485830 | 0.320240 | -0.297590 | -0.451780 | -0.479800 | -0.447430 | -0.412810 | -0.372580 | -0.334760|
| k15 | a11 | 2.124440 | 0.025770 | -0.502180 | -0.600630 | -0.584050 | -0.528160 | -0.475630 | -0.425110 | -0.380300|
| k15 | a12 | 1.832980 | -0.222980 | -0.668050 | -0.722780 | -0.674940 | -0.598000 | -0.530580 | -0.467380 | -0.410120|
| k15 | a13 | 1.557110 | -0.447000 | -0.817450 | -0.830980 | -0.753230 | -0.657920 | -0.578840 | -0.501900 | -0.439280|
| k15 | a14 | 1.306110 | -0.648530 | -0.950880 | -0.921060 | -0.814690 | -0.709840 | -0.616970 | -0.530250 | -0.461580|
| k15 | a15 | 1.112060 | -0.812650 | -1.062250 | -0.994630 | -0.865360 | -0.750250 | -0.646000 | -0.553940 | -0.479310|
| k17 | a1 | 0.000150 | 0.000140 | -0.000110 | -0.000570 | -0.000380 | -0.000370 | -0.000440 | -0.000530 | -0.000480|
| k17 | a2 | 1.512110 | 0.043590 | -0.250870 | -0.305500 | -0.289100 | -0.265680 | -0.238970 | -0.213000 | -0.189830|
| k17 | a3 | 0.354780 | -0.479610 | -0.566660 | -0.524200 | -0.461020 | -0.402590 | -0.355180 | -0.314360 | -0.277850|
| k17 | a4 | -0.209270 | -0.798840 | -0.791850 | -0.695950 | -0.593130 | -0.508760 | -0.442180 | -0.387150 | -0.338320|
| k17 | a5 | -0.627560 | -1.086420 | -0.994980 | -0.844790 | -0.711900 | -0.603050 | -0.516880 | -0.449950 | -0.391720|
| k17 | a6 | -0.975170 | -1.323500 | -1.155760 | -0.969450 | -0.808290 | -0.673630 | -0.578370 | -0.494410 | -0.428500|
| k17 | a7 | -1.283760 | -1.539740 | -1.312020 | -1.078450 | -0.886370 | -0.733570 | -0.624180 | -0.530700 | -0.455620|
| k17 | a8 | -1.535340 | -1.715490 | -1.434420 | -1.164500 | -0.951630 | -0.786110 | -0.663750 | -0.560430 | -0.479450|
| k17 | a9 | -1.728220 | -1.850700 | -1.523750 | -1.227920 | -0.996250 | -0.820180 | -0.689610 | -0.579590 | -0.495800|
| k17 | a10 | -1.878920 | -1.961250 | -1.597650 | -1.278730 | -1.033720 | -0.850110 | -0.711940 | -0.597750 | -0.509650|
| k17 | a11 | -1.986810 | -2.038950 | -1.648630 | -1.313300 | -1.058730 | -0.868020 | -0.725140 | -0.607200 | -0.518300|
| k17 | a12 | -2.062580 | -2.087300 | -1.680810 | -1.335940 | -1.075370 | -0.879730 | -0.733920 | -0.614360 | -0.523440|
| k17 | a13 | -2.106290 | -2.115610 | -1.698750 | -1.347330 | -1.082550 | -0.884450 | -0.737140 | -0.617460 | -0.526260|
| k17 | a14 | -2.128430 | -2.125540 | -1.703920 | -1.350370 | -1.083270 | -0.884450 | -0.737000 | -0.617750 | -0.526000|
| k17 | a15 | -2.125230 | -2.120690 | -1.697410 | -1.342970 | -1.076970 | -0.878730 | -0.731790 | -0.613590 | -0.522740|


### S. pneumoniae dataset


#### Kmer spectrum

In [33]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df = pandas.read_csv("stats/kmer_spectrum/s_pneumoniae.csv", index_col=0)

fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns])

fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns],
               layout=go.Layout(yaxis=dict(range=[0, 1_000_000]),
                                xaxis=dict(range=[0, 255])))

#fig.update_layout(yaxis_type="log")
fig.show()

#### Correction with reads kmer

In [16]:
display.Markdown(pconbr.identity.read_kmer("SRR8556426"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000|
| k13 | a2 | 0.002070 | 0.001890 | 0.001860 | 0.001810 | 0.001770 | 0.001750 | 0.001740 | 0.001740 | 0.001740|
| k13 | a3 | 0.009270 | 0.007980 | 0.007380 | 0.007140 | 0.007020 | 0.006850 | 0.006740 | 0.006710 | 0.006690|
| k13 | a4 | 0.022270 | 0.019900 | 0.018810 | 0.018170 | 0.017600 | 0.017660 | 0.017460 | 0.017350 | 0.017220|
| k13 | a5 | 0.046300 | 0.040430 | 0.037060 | 0.035450 | 0.033960 | 0.033080 | 0.032660 | 0.032880 | 0.032740|
| k13 | a6 | 0.076610 | 0.064340 | 0.058650 | 0.055470 | 0.052160 | 0.050700 | 0.049590 | 0.048940 | 0.049240|
| k13 | a7 | 0.113030 | 0.095510 | 0.085890 | 0.079370 | 0.075040 | 0.073520 | 0.072130 | 0.071150 | 0.070820|
| k13 | a8 | 0.157360 | 0.133590 | 0.117870 | 0.109680 | 0.103580 | 0.100180 | 0.098560 | 0.096860 | 0.095730|
| k13 | a9 | 0.199860 | 0.169870 | 0.151280 | 0.139390 | 0.129500 | 0.124400 | 0.121550 | 0.118480 | 0.116970|
| k13 | a10 | 0.259520 | 0.215860 | 0.187490 | 0.169780 | 0.158320 | 0.151270 | 0.146860 | 0.142510 | 0.140070|
| k13 | a11 | 0.315560 | 0.260000 | 0.225610 | 0.205450 | 0.191250 | 0.181830 | 0.175510 | 0.170580 | 0.167400|
| k13 | a12 | 0.375640 | 0.303700 | 0.263690 | 0.235870 | 0.215620 | 0.203110 | 0.196170 | 0.191510 | 0.187340|
| k13 | a13 | 0.440810 | 0.354220 | 0.300960 | 0.266100 | 0.245030 | 0.231560 | 0.223190 | 0.218080 | 0.213320|
| k13 | a14 | 0.506040 | 0.399030 | 0.338860 | 0.297830 | 0.273120 | 0.257570 | 0.247500 | 0.240210 | 0.234740|
| k13 | a15 | 0.569800 | 0.443910 | 0.372380 | 0.328610 | 0.299730 | 0.281520 | 0.269060 | 0.259670 | 0.252360|
| k15 | a1 | -0.000470 | -0.000390 | -0.000230 | -0.000180 | -0.000120 | -0.000100 | -0.000080 | -0.000050 | -0.000040|
| k15 | a2 | 1.449380 | 0.834280 | 0.520350 | 0.359010 | 0.264000 | 0.204650 | 0.167160 | 0.142340 | 0.123110|
| k15 | a3 | 2.413330 | 1.284030 | 0.765470 | 0.497990 | 0.349640 | 0.260290 | 0.204440 | 0.162180 | 0.133800|
| k15 | a4 | 2.987640 | 1.501120 | 0.853420 | 0.529670 | 0.358110 | 0.251560 | 0.186210 | 0.141800 | 0.110830|
| k15 | a5 | 3.313210 | 1.577480 | 0.850590 | 0.506360 | 0.324110 | 0.216690 | 0.150100 | 0.105140 | 0.075490|
| k15 | a6 | 3.490740 | 1.577920 | 0.819850 | 0.463470 | 0.279590 | 0.173200 | 0.104040 | 0.064900 | 0.036810|
| k15 | a7 | 3.547690 | 1.542040 | 0.761230 | 0.408480 | 0.226890 | 0.125470 | 0.063490 | 0.024180 | -0.003500|
| k15 | a8 | 3.545130 | 1.475180 | 0.698600 | 0.354320 | 0.178570 | 0.080220 | 0.024130 | -0.011520 | -0.034020|
| k15 | a9 | 3.501880 | 1.404740 | 0.633270 | 0.298200 | 0.129770 | 0.034900 | -0.015800 | -0.044910 | -0.064630|
| k15 | a10 | 3.414240 | 1.322940 | 0.565620 | 0.239380 | 0.076670 | -0.004330 | -0.047490 | -0.070650 | -0.089160|
| k15 | a11 | 3.329020 | 1.236130 | 0.493860 | 0.183240 | 0.034190 | -0.040220 | -0.080210 | -0.101420 | -0.116490|
| k15 | a12 | 3.230490 | 1.156750 | 0.433440 | 0.137360 | -0.001180 | -0.071110 | -0.106160 | -0.125800 | -0.138510|
| k15 | a13 | 3.117170 | 1.074210 | 0.375290 | 0.088840 | -0.039950 | -0.102830 | -0.134970 | -0.150910 | -0.160240|
| k15 | a14 | 3.001770 | 0.992860 | 0.315750 | 0.044940 | -0.076820 | -0.133640 | -0.162300 | -0.173670 | -0.179900|
| k15 | a15 | 2.894750 | 0.925340 | 0.263780 | 0.007950 | -0.108210 | -0.158590 | -0.183230 | -0.193840 | -0.197400|
| k17 | a1 | -0.001160 | -0.000680 | -0.000400 | -0.000290 | -0.000250 | -0.000240 | -0.000220 | -0.000220 | -0.000220|
| k17 | a2 | 1.758330 | 0.375990 | 0.026820 | -0.085900 | -0.126430 | -0.140500 | -0.144070 | -0.142170 | -0.137820|
| k17 | a3 | 1.038080 | 0.096470 | -0.137790 | -0.207670 | -0.226220 | -0.226200 | -0.220180 | -0.211990 | -0.203590|
| k17 | a4 | 0.652130 | -0.063690 | -0.234670 | -0.278990 | -0.285080 | -0.278730 | -0.266370 | -0.252860 | -0.240710|
| k17 | a5 | 0.414220 | -0.171800 | -0.293540 | -0.319960 | -0.318790 | -0.305470 | -0.291770 | -0.276860 | -0.262390|
| k17 | a6 | 0.237380 | -0.252590 | -0.340480 | -0.353300 | -0.344810 | -0.327990 | -0.310770 | -0.293930 | -0.277800|
| k17 | a7 | 0.099840 | -0.310490 | -0.373170 | -0.377980 | -0.363750 | -0.343970 | -0.325320 | -0.307290 | -0.290540|
| k17 | a8 | -0.012880 | -0.362820 | -0.405900 | -0.403760 | -0.384220 | -0.362150 | -0.341670 | -0.322220 | -0.304780|
| k17 | a9 | -0.106140 | -0.409070 | -0.436570 | -0.426320 | -0.402220 | -0.377790 | -0.356050 | -0.334170 | -0.315230|
| k17 | a10 | -0.190270 | -0.447950 | -0.460090 | -0.444850 | -0.419440 | -0.394080 | -0.368660 | -0.345320 | -0.326410|
| k17 | a11 | -0.269260 | -0.485810 | -0.484460 | -0.463180 | -0.435120 | -0.407350 | -0.381390 | -0.358100 | -0.337330|
| k17 | a12 | -0.341380 | -0.521760 | -0.508740 | -0.482100 | -0.452330 | -0.422440 | -0.395970 | -0.371490 | -0.349300|
| k17 | a13 | -0.411050 | -0.556530 | -0.530050 | -0.501450 | -0.468640 | -0.437470 | -0.409020 | -0.383390 | -0.360550|
| k17 | a14 | -0.477310 | -0.588360 | -0.551150 | -0.517370 | -0.483500 | -0.449420 | -0.420320 | -0.394190 | -0.370290|
| k17 | a15 | -0.534200 | -0.618490 | -0.569040 | -0.532760 | -0.496270 | -0.462130 | -0.432030 | -0.405250 | -0.380700|


###  S. cerevisiae dataset

## Long read correction

To evaluate our correction against other tools we : 
- result against reference genome we use [ELECTOR](//doi.org/10.1101/512889) 
- assembly result (redbean, rala, flye) we use [QUAST](//doi.org/10.1093/bioinformatics/bty266)

### Self correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|
| CONSENT    | [10.1101/546630](//doi.org/10.1101/546630)                               |
| daccord    | [10.1101/106252](//doi.org/10.1101/106252)                               |
| FLAS       | [10.1093/bioinformatics/btz206](//doi.org/10.1093/bioinformatics/btz206) |
| MECAT      | [10.1038/nmeth.4432](//doi.org/10.1038/nmeth.4432)                       |

#### Mapping result

In [17]:
display.Markdown("TODO")

TODO

#### Assembly result

In [18]:
display.Markdown("TODO")

TODO

### Hybrid correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|


#### Mapping result

In [19]:
display.Markdown("TODO")

TODO

##### Assembly result

In [20]:
display.Markdown("TODO")

TODO

## Polishing

