In [1]:
from IPython import display
import pconbr
import pconbr.bench
import pconbr.bench.count
import pconbr.identity
import pconbr.kmer_count
import pconbr.kmer_count.curve

# Code name: pconbr

Project target: Use a kmer counter to perform a pre-correction step on long-read data

## Dataset

### References

| code name       | species         | path                          | genome size | 
|:----------------|:----------------|:------------------------------|------------:|
| s_pneumoniae    | S. pneumoniae   | reference/CP026549.fasta      |      2.2 Mb |
| c_vartiovaarae  | C. vartiovaarae |                               |     ~11.2Mb |
| e_coli_ont      | E. coli         | reference/CP028309.fasta      |       4.7Mb |
| e_coli_pb       | E. coli         | reference/CP028309.fasta      |       4.7Mb |
| s_cerevisiae    | S. cerevisiae   | reference/GCA_002163515.fasta |      12.4Mb |


### Reads
| code name       | species         | path                        | # bases (Gb)| coverage |
|:----------------|:----------------|:----------------------------|------------:|---------:|
| s_pneumoniae    | S. pneumoniae   | reads/SRR8556426.fasta      |         2.2 |   ~1000x |
| c_vartiovaarae  | C. vartiovaarae | reads/ERR18779[66-70].fasta |         1.7 |    ~150x |
| e_coli_ont      | E. coli         | reads/SRR8494940.fasta      |         1.6 |    ~340x |
| e_coli_pb       | E. coli         | reads/SRR8494911.fasta      |         1.4 |    ~297x |
| s_cerevisiae    | S. cerevisiae   | reads/SRR2157264_[1-2]      |       0.187 |     ~15x |



In [2]:
# To download reference genome uncomment next line and execute this cell can take many time
#!./script/dl_ref.sh

In [3]:
# To download data uncomment next line and execute this cell can take many time
#!./script/dl_reads.sh

## Kmer counting

In [4]:
# To perform pcon kmc and jellyfish count on dataset uncomment next line and execute this cell
#!snakemake -s pipeline/count.snakefile all

File benchmark/{counter name}/{dataset codename}.tsv contains time (in second) and memory (in Mb) usage of each run this information was resume in this table.

In [5]:
display.Markdown(pconbr.bench.count.get("time"))

| dataset | k | Jellyfish | Kmc | Pconbr |
|:-|:-|-:|-:|-:|
| c_vartiovaarae | k13 | 413.5882 | 193.6503 | 53.1597 |
| c_vartiovaarae | k15 | 766.9468 | 784.6563 | 63.4619 |
| c_vartiovaarae | k17 | 1288.2844 | 833.9791 | 252.0244 |
| e_coli_ont | k13 | 403.0423 | 184.5209 | 49.9495 |
| e_coli_ont | k15 | 1135.5411 | 724.3784 | 61.9082 |
| e_coli_ont | k17 | 1166.9966 | 780.5869 | 214.6334 |
| e_coli_pb | k13 | 361.0433 | 157.6295 | 44.4903 |
| e_coli_pb | k15 | 1456.8800 | 732.9156 | 54.2674 |
| e_coli_pb | k17 | 1255.0258 | 766.6578 | 215.9037 |
| s_cerevisiae | k13 | 56.8721 | 24.2400 | 6.6724 |
| s_cerevisiae | k15 | 123.3701 | 130.9554 | 9.8227 |
| s_cerevisiae | k17 | 290.8972 | 133.2776 | 63.6876 |
| s_pneumoniae | k13 | 540.4675 | 265.6703 | 66.9792 |
| s_pneumoniae | k15 | 905.0415 | 870.6088 | 84.3198 |
| s_pneumoniae | k17 | 1301.9399 | 939.4879 | 284.0073 |


In [6]:
display.Markdown(pconbr.bench.count.get("memory"))

| dataset | k | Jellyfish | Kmc | Pconbr |
|:-|:-|-:|-:|-:|
| c_vartiovaarae | k13 | 1581.97 | 2143.56 | 22.00 |
| c_vartiovaarae | k15 | 6204.13 | 10830.66 | 262.54 |
| c_vartiovaarae | k17 | 16391.34 | 11046.30 | 4101.01 |
| e_coli_ont | k13 | 1387.03 | 2219.09 | 22.03 |
| e_coli_ont | k15 | 22121.59 | 10636.99 | 262.75 |
| e_coli_ont | k17 | 16391.30 | 10898.98 | 4103.75 |
| e_coli_pb | k13 | 1993.30 | 2034.21 | 21.91 |
| e_coli_pb | k15 | 35264.41 | 10717.47 | 262.36 |
| e_coli_pb | k17 | 16391.14 | 11079.68 | 4103.34 |
| s_cerevisiae | k13 | 257.59 | 657.64 | 22.02 |
| s_cerevisiae | k15 | 1957.95 | 1442.41 | 262.57 |
| s_cerevisiae | k17 | 16391.10 | 1367.47 | 4092.84 |
| s_pneumoniae | k13 | 1783.90 | 2810.83 | 21.98 |
| s_pneumoniae | k15 | 11002.25 | 11242.54 | 262.57 |
| s_pneumoniae | k17 | 16391.33 | 11273.15 | 4103.57 |


## PconBr parameter exploration

### Simulated dataset

Error rate was evaluate by `samtools stats` line `error rate:`.

Read was simulate by [Badread](https://github.com/rrwick/Badread) on E. coli CFT073 genome ([ENA id CP028309](https://www.ebi.ac.uk/ena/data/view/CP028309)), error rate 5.625682.

We evaluate identity before pconbr pipeline with diffrente value of kmer size (k), number of kmer was required to validate kmer (s), abundance minimal of solid kmer (a).


In [7]:
# Run some snakemake pipeline to test parameter on dataset
#!snakemake -s pipeline/parameter_exploration.snakefile genomic_kmer
#!snakemake -s pipeline/parameter_exploration.snakefile read_kmer
#!snakemake -s pipeline/parameter_exploration.snakefile bacteria
#!snakemake -s pipeline/parameter_exploration.snakefile yeast

### Effect of k and s on synthetic dataset

#### With genomic kmer

Difference between original error rate and the corrected read error rate

In [18]:
display.Markdown(pconbr.identity.genomic_kmer())

| | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k9 | 0.004243 | 0.003956 | 0.003789 | 0.003733 | 0.003733 | 0.003733 | 0.003730 | 0.003727 | 0.003724|
| k11 | 0.758194 | 0.610341 | 0.515930 | 0.454310 | 0.413753 | 0.385005 | 0.363640 | 0.346665 | 0.331660|
| k13 | 9.158648 | 2.776566 | 0.661774 | -0.118602 | -0.436913 | -0.569549 | -0.619379 | -0.631191 | -0.624060|
| k15 | -0.593299 | -1.675826 | -1.759112 | -1.682531 | -1.574692 | -1.469168 | -1.372143 | -1.283744 | -1.203183|
| k17 | -2.280677 | -2.119799 | -1.940914 | -1.784599 | -1.645088 | -1.523983 | -1.417599 | -1.322485 | -1.236514|


#### Kmer specturm of noisy read

In [31]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df = pandas.read_csv("stats/simulated_reads.kmer_spectrum.csv", index_col=0)
print(df)
fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns])

fig.show()

         k9     k11
0         0      66
1         0     314
2         0     827
3         0    1653
4         0    2682
..      ...     ...
251       9    3009
252      12    3088
253      11    3213
254       9    3146
255  129628  643673

[256 rows x 2 columns]


#### Correction of noisy read kmer

In [10]:
display.Markdown(pconbr.identity.read_kmer("simulated_reads"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000029 | 0.000026 | 0.000026 | 0.000023 | 0.000023 | 0.000022 | 0.000020 | 0.000017 | 0.000014|
| k13 | a2 | 0.183153 | 0.144473 | 0.120515 | 0.105219 | 0.095360 | 0.088957 | 0.084602 | 0.081637 | 0.079537|
| k13 | a3 | 0.565981 | 0.422615 | 0.337803 | 0.285259 | 0.251894 | 0.229637 | 0.214533 | 0.203712 | 0.195564|
| k13 | a4 | 1.037906 | 0.737499 | 0.565391 | 0.460931 | 0.395181 | 0.351924 | 0.322046 | 0.300782 | 0.284209|
| k13 | a5 | 1.529697 | 1.031342 | 0.756427 | 0.594780 | 0.494429 | 0.429087 | 0.383908 | 0.351254 | 0.325804|
| k13 | a6 | 2.006685 | 1.283683 | 0.899364 | 0.680178 | 0.546947 | 0.460584 | 0.401566 | 0.358444 | 0.324519|
| k13 | a7 | 2.466110 | 1.490862 | 0.993164 | 0.719183 | 0.555940 | 0.451637 | 0.381025 | 0.329409 | 0.289094|
| k13 | a8 | 2.921537 | 1.660569 | 1.049283 | 0.723642 | 0.534254 | 0.414936 | 0.335592 | 0.278149 | 0.233187|
| k13 | a9 | 3.356516 | 1.796660 | 1.073578 | 0.701064 | 0.489425 | 0.359049 | 0.273636 | 0.212436 | 0.165375|
| k15 | a1 | 0.001020 | 0.001010 | 0.000757 | 0.000529 | 0.000361 | 0.000257 | 0.000175 | 0.000125 | 0.000087|
| k15 | a2 | 4.808098 | 1.496592 | 0.480983 | 0.117895 | -0.031171 | -0.097850 | -0.128863 | -0.143265 | -0.149054|
| k15 | a3 | 3.067901 | 0.583266 | -0.126570 | -0.354596 | -0.429535 | -0.448909 | -0.446697 | -0.435142 | -0.420358|
| k15 | a4 | 1.741794 | -0.170984 | -0.642049 | -0.762086 | -0.776632 | -0.757068 | -0.726243 | -0.692545 | -0.659247|
| k15 | a5 | 0.899558 | -0.681172 | -1.006135 | -1.056697 | -1.030743 | -0.984044 | -0.932765 | -0.882585 | -0.835033|
| k15 | a6 | 0.373487 | -1.014482 | -1.250142 | -1.256841 | -1.204346 | -1.138853 | -1.073398 | -1.011505 | -0.953997|
| k15 | a7 | 0.046119 | -1.229433 | -1.410892 | -1.389698 | -1.319745 | -1.242278 | -1.167164 | -1.097303 | -1.032796|
| k15 | a8 | -0.160191 | -1.368173 | -1.515180 | -1.476307 | -1.395058 | -1.309397 | -1.227911 | -1.152800 | -1.083657|
| k15 | a9 | -0.293691 | -1.459802 | -1.585154 | -1.534622 | -1.445692 | -1.354574 | -1.268870 | -1.190011 | -1.117721|
| k17 | a1 | 0.000519 | 0.000192 | -0.000056 | -0.000148 | -0.000163 | -0.000145 | -0.000132 | -0.000126 | -0.000116|
| k17 | a2 | 0.052245 | -0.463331 | -0.543232 | -0.538630 | -0.514668 | -0.487277 | -0.460951 | -0.436353 | -0.413917|
| k17 | a3 | -0.824204 | -1.060827 | -1.046410 | -0.992918 | -0.933189 | -0.876647 | -0.824624 | -0.776994 | -0.733466|
| k17 | a4 | -1.406208 | -1.472354 | -1.393893 | -1.303980 | -1.216514 | -1.136892 | -1.065155 | -1.000227 | -0.940937|
| k17 | a5 | -1.754412 | -1.720155 | -1.601790 | -1.488081 | -1.382459 | -1.288192 | -1.203961 | -1.128035 | -1.058969|
| k17 | a6 | -1.952835 | -1.863846 | -1.721692 | -1.593184 | -1.476421 | -1.373143 | -1.281502 | -1.199048 | -1.124185|
| k17 | a7 | -2.069031 | -1.949423 | -1.793177 | -1.655272 | -1.531456 | -1.422734 | -1.326426 | -1.239928 | -1.161537|
| k17 | a8 | -2.138176 | -2.001680 | -1.836938 | -1.693116 | -1.564693 | -1.452537 | -1.353352 | -1.264395 | -1.183822|
| k17 | a9 | -2.181178 | -2.035071 | -1.865341 | -1.717630 | -1.586192 | -1.471614 | -1.370560 | -1.279898 | -1.197926|


#### On real bacteria dataset

In [11]:
display.Markdown(pconbr.identity.read_kmer("SRR8494911"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000|
| k13 | a2 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 0.000010|
| k13 | a3 | 0.000070 | 0.000070 | 0.000070 | 0.000070 | 0.000070 | 0.000070 | 0.000060 | 0.000060 | 0.000060|
| k13 | a4 | 0.000170 | 0.000140 | 0.000120 | 0.000110 | 0.000190 | 0.000180 | 0.000180 | 0.000180 | 0.000180|
| k13 | a5 | 0.001040 | 0.000960 | 0.000950 | 0.000840 | 0.000860 | 0.000850 | 0.000840 | 0.000840 | 0.000760|
| k13 | a6 | 0.002410 | 0.002150 | 0.002480 | 0.002310 | 0.002290 | 0.002140 | 0.002140 | 0.002140 | 0.002110|
| k13 | a7 | 0.007530 | 0.007510 | 0.007350 | 0.007250 | 0.007160 | 0.007130 | 0.007010 | 0.006960 | 0.006980|
| k13 | a8 | 0.012980 | 0.012880 | 0.012180 | 0.012550 | 0.011770 | 0.011620 | 0.011770 | 0.011840 | 0.011690|
| k13 | a9 | 0.024840 | 0.023170 | 0.021780 | 0.020860 | 0.020330 | 0.020080 | 0.019940 | 0.019740 | 0.019750|
| k15 | a1 | 0.000010 | 0.000030 | 0.000080 | 0.000110 | 0.000240 | 0.000230 | 0.000230 | 0.000210 | 0.000150|
| k15 | a2 | 1.285890 | 0.869770 | 0.590310 | 0.429650 | 0.321220 | 0.254330 | 0.207720 | 0.175090 | 0.150000|
| k15 | a3 | 2.901260 | 1.729940 | 1.069500 | 0.713960 | 0.490880 | 0.355920 | 0.270980 | 0.216940 | 0.175610|
| k15 | a4 | 3.784430 | 2.038960 | 1.159580 | 0.702470 | 0.447910 | 0.296490 | 0.208470 | 0.152260 | 0.114200|
| k15 | a5 | 4.052320 | 1.978880 | 1.015100 | 0.553140 | 0.302050 | 0.164370 | 0.093890 | 0.048500 | 0.019780|
| k15 | a6 | 3.928510 | 1.715160 | 0.753500 | 0.330360 | 0.121080 | 0.016510 | -0.041450 | -0.065400 | -0.082810|
| k15 | a7 | 3.638710 | 1.368680 | 0.468670 | 0.097780 | -0.052390 | -0.125070 | -0.155010 | -0.158630 | -0.160000|
| k15 | a8 | 3.250950 | 0.990100 | 0.196960 | -0.107210 | -0.207500 | -0.250590 | -0.244420 | -0.234840 | -0.222430|
| k15 | a9 | 2.881530 | 0.652210 | -0.064030 | -0.285910 | -0.355840 | -0.360780 | -0.336790 | -0.310690 | -0.281380|
| k17 | a1 | 0.000150 | 0.000140 | -0.000110 | -0.000570 | -0.000380 | -0.000370 | -0.000440 | -0.000530 | -0.000480|
| k17 | a2 | 1.512110 | 0.043590 | -0.250870 | -0.305500 | -0.289100 | -0.265680 | -0.238970 | -0.213000 | -0.189830|
| k17 | a3 | 0.354780 | -0.479610 | -0.566660 | -0.524200 | -0.461020 | -0.402590 | -0.355180 | -0.314360 | -0.277850|
| k17 | a4 | -0.209270 | -0.798840 | -0.791850 | -0.695950 | -0.593130 | -0.508760 | -0.442180 | -0.387150 | -0.338320|
| k17 | a5 | -0.627560 | -1.086420 | -0.994980 | -0.844790 | -0.711900 | -0.603050 | -0.516880 | -0.449950 | -0.391720|
| k17 | a6 | -0.975170 | -1.323500 | -1.155760 | -0.969450 | -0.808290 | -0.673630 | -0.578370 | -0.494410 | -0.428500|
| k17 | a7 | -1.283760 | -1.539740 | -1.312020 | -1.078450 | -0.886370 | -0.733570 | -0.624180 | -0.530700 | -0.455620|
| k17 | a8 | -1.535340 | -1.715490 | -1.434420 | -1.164500 | -0.951630 | -0.786110 | -0.663750 | -0.560430 | -0.479450|
| k17 | a9 | -1.728220 | -1.850700 | -1.523750 | -1.227920 | -0.996250 | -0.820180 | -0.689610 | -0.579590 | -0.495800|


In [12]:
display.Markdown(pconbr.identity.read_kmer("SRR8494940"))

| | | s1| s2| s3| s4| s5| s6| s7| s8| s9|
|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|-:|
| k13 | a1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000|
| k13 | a2 | 0.002080 | 0.001930 | 0.001790 | 0.002060 | 0.002190 | 0.002170 | 0.002170 | 0.002170 | 0.002170|
| k13 | a3 | 0.007390 | 0.005960 | 0.006370 | 0.006500 | 0.006010 | 0.005770 | 0.005520 | 0.005580 | 0.005570|
| k13 | a4 | 0.022760 | 0.021010 | 0.016540 | 0.015990 | 0.016240 | 0.016080 | 0.016160 | 0.016100 | 0.016070|
| k13 | a5 | 0.040810 | 0.033690 | 0.031390 | 0.029940 | 0.028690 | 0.028010 | 0.027390 | 0.027480 | 0.026570|
| k13 | a6 | 0.064580 | 0.053350 | 0.049360 | 0.047040 | 0.046350 | 0.045950 | 0.043560 | 0.043670 | 0.044350|
| k13 | a7 | 0.093220 | 0.078590 | 0.069830 | 0.068580 | 0.067590 | 0.065270 | 0.065900 | 0.064790 | 0.064890|
| k13 | a8 | 0.132410 | 0.109700 | 0.096130 | 0.090450 | 0.086220 | 0.082460 | 0.080520 | 0.079270 | 0.078740|
| k13 | a9 | 0.169740 | 0.142450 | 0.124890 | 0.117010 | 0.111130 | 0.105900 | 0.101940 | 0.100750 | 0.100090|
| k15 | a1 | -0.000120 | -0.000090 | -0.000100 | -0.000100 | -0.000110 | -0.000110 | -0.000110 | -0.000120 | 0.000010|
| k15 | a2 | 1.423390 | 0.888880 | 0.577720 | 0.406670 | 0.313520 | 0.255960 | 0.215180 | 0.184910 | 0.161200|
| k15 | a3 | 2.421420 | 1.448110 | 0.908490 | 0.617190 | 0.455280 | 0.346270 | 0.283010 | 0.238800 | 0.203510|
| k15 | a4 | 3.046960 | 1.683210 | 1.006540 | 0.654910 | 0.451250 | 0.331200 | 0.258740 | 0.210900 | 0.168250|
| k15 | a5 | 3.330690 | 1.775760 | 1.004910 | 0.610000 | 0.402730 | 0.275060 | 0.192900 | 0.147750 | 0.111000|
| k15 | a6 | 3.484930 | 1.753000 | 0.913280 | 0.528390 | 0.322170 | 0.207120 | 0.127480 | 0.082390 | 0.057170|
| k15 | a7 | 3.485880 | 1.648830 | 0.823040 | 0.442340 | 0.249870 | 0.136740 | 0.068280 | 0.028370 | 0.000490|
| k15 | a8 | 3.452590 | 1.542000 | 0.724210 | 0.352350 | 0.170830 | 0.062920 | 0.013430 | -0.020800 | -0.043570|
| k15 | a9 | 3.363460 | 1.421180 | 0.606830 | 0.245390 | 0.088810 | -0.003640 | -0.047870 | -0.068580 | -0.086450|
| k17 | a1 | 0.000050 | 0.000000 | -0.000110 | -0.000120 | -0.000140 | -0.000160 | -0.000170 | -0.000240 | -0.000240|
| k17 | a2 | 1.956870 | 0.465040 | 0.048770 | -0.074490 | -0.114260 | -0.122720 | -0.121880 | -0.121830 | -0.118460|
| k17 | a3 | 1.018780 | 0.086340 | -0.143390 | -0.199680 | -0.211110 | -0.203190 | -0.191350 | -0.178470 | -0.168480|
| k17 | a4 | 0.554690 | -0.121230 | -0.252290 | -0.286540 | -0.271460 | -0.258180 | -0.242200 | -0.227570 | -0.213100|
| k17 | a5 | 0.216180 | -0.263960 | -0.342110 | -0.348950 | -0.327410 | -0.305160 | -0.285740 | -0.264320 | -0.248920|
| k17 | a6 | -0.036350 | -0.382830 | -0.413190 | -0.402350 | -0.370430 | -0.345760 | -0.319240 | -0.297690 | -0.276720|
| k17 | a7 | -0.245580 | -0.474910 | -0.470180 | -0.452330 | -0.416950 | -0.381590 | -0.355370 | -0.330780 | -0.307450|
| k17 | a8 | -0.426010 | -0.555840 | -0.515770 | -0.490020 | -0.451070 | -0.410990 | -0.381700 | -0.353930 | -0.328180|
| k17 | a9 | -0.564010 | -0.619060 | -0.559130 | -0.521820 | -0.477220 | -0.435800 | -0.399160 | -0.370730 | -0.346600|


#### On real yeast dataset

## Long read correction

To evaluate our correction against other tools we : 
- result against reference genome we use [ELECTOR](//doi.org/10.1101/512889) 
- assembly result (redbean, rala, flye) we use [QUAST](//doi.org/10.1093/bioinformatics/bty266)

### Self correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|
| CONSENT    | [10.1101/546630](//doi.org/10.1101/546630)                               |
| daccord    | [10.1101/106252](//doi.org/10.1101/106252)                               |
| FLAS       | [10.1093/bioinformatics/btz206](//doi.org/10.1093/bioinformatics/btz206) |
| MECAT      | [10.1038/nmeth.4432](//doi.org/10.1038/nmeth.4432)                       |

#### Mapping result

In [13]:
display.Markdown("TODO")

TODO

#### Assembly result

In [14]:
display.Markdown("TODO")

TODO

### Hybrid correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|


#### Mapping result

In [15]:
display.Markdown("TODO")

TODO

##### Assembly result

In [16]:
display.Markdown("TODO")

TODO

## Polishing

