In [2]:
import pconbr
import pconbr.bench
import pconbr.bench.count
import pconbr.identity
import pconbr.kmer_count
import pconbr.kmer_count.curve

# Code name: pconbr

Project target: Use a kmer counter to perform a pre-correction step on long-read data

## Dataset

### References

| code name       | species         | path                          | genome size | 
|:----------------|:----------------|:------------------------------|------------:|
| s_pneumoniae    | S. pneumoniae   | referencesCP026549.fasta      |      2.2 Mb |
| c_vartiovaarae  | C. vartiovaarae |                               |     ~11.2Mb |
| e_coli_ont      | E. coli         | references/CP028309.fasta      |       4.7Mb |
| e_coli_pb       | E. coli         | references/CP028309.fasta      |       4.7Mb |
| s_cerevisiae    | S. cerevisiae   | references/GCA_002163515.fasta |      12.4Mb |


### Reads
| code name       | species         | path                        | # bases (Gb)| coverage |
|:----------------|:----------------|:----------------------------|------------:|---------:|
| s_pneumoniae    | S. pneumoniae   | reads/SRR8556426.fasta      |         2.2 |   ~1000x |
| c_vartiovaarae  | C. vartiovaarae | reads/ERR18779[66-70].fasta |         1.7 |    ~150x |
| e_coli_ont      | E. coli         | reads/SRR8494940.fasta      |         1.6 |    ~340x |
| e_coli_pb       | E. coli         | reads/SRR8494911.fasta      |         1.4 |    ~297x |
| s_cerevisiae    | S. cerevisiae   | reads/SRR2157264_[1-2]      |       0.187 |     ~15x |



In [2]:
# To download reference genome uncomment next line and execute this cell can take many time
#!./script/dl_ref.sh

In [3]:
# To download data uncomment next line and execute this cell can take many time
#!./script/dl_reads.sh

## Kmer counting

In [4]:
# To perform pcon kmc and jellyfish count on dataset uncomment next line and execute this cell
#!snakemake -s pipeline/count.snakefile count_all

File benchmark/{counter name}/{dataset codename}.tsv contains time (in second) and memory (in Mb) usage of each run this information was resume in this table.

In [5]:
pconbr.bench.count.get("time")

Unnamed: 0_level_0,Unnamed: 1_level_0,jellyfish,kmc,pcon
dataset,k,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c_vartiovaarae,k13,413.5882,193.6503,53.1597
c_vartiovaarae,k15,766.9468,784.6563,63.4619
c_vartiovaarae,k17,1288.2844,833.9791,252.0244
e_coli_ont,k13,403.0423,184.5209,49.9495
e_coli_ont,k15,1135.5411,724.3784,61.9082
e_coli_ont,k17,1166.9966,780.5869,214.6334
e_coli_pb,k13,361.0433,157.6295,44.4903
e_coli_pb,k15,1456.88,732.9156,54.2674
e_coli_pb,k17,1255.0258,766.6578,215.9037
s_cerevisiae,k13,56.8721,24.24,6.6724


In [6]:
pconbr.bench.count.get("memory")

Unnamed: 0_level_0,Unnamed: 1_level_0,jellyfish,kmc,pcon
dataset,k,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c_vartiovaarae,k13,1581.97,2143.56,22.0
c_vartiovaarae,k15,6204.13,10830.66,262.54
c_vartiovaarae,k17,16391.34,11046.3,4101.01
e_coli_ont,k13,1387.03,2219.09,22.03
e_coli_ont,k15,22121.59,10636.99,262.75
e_coli_ont,k17,16391.3,10898.98,4103.75
e_coli_pb,k13,1993.3,2034.21,21.91
e_coli_pb,k15,35264.41,10717.47,262.36
e_coli_pb,k17,16391.14,11079.68,4103.34
s_cerevisiae,k13,257.59,657.64,22.02


## PconBr on simulated dataset evaluation


Error rate was evaluate by `samtools stats` line `error rate:`.

Read was simulate by [Badread](https://github.com/rrwick/Badread) on E. coli CFT073 genome ([ENA id CP028309](https://www.ebi.ac.uk/ena/data/view/CP028309)), error rate 5.625682.

We evaluate identity before pconbr pipeline with diffrente value of kmer size (k), br method.

In [7]:
# Run some snakemake pipeline to test parameter on dataset
#!snakemake -s pipeline/all.snakefile pconbr_eval

### Synthetic dataset 95

Error rate:

In [3]:
float(pconbr.identity.get_error_rate("reads/simulated_reads_95.stats"))*100

FileNotFoundError: [Errno 2] No such file or directory: 'reads/simulated_reads_95.stats'

#### Kmer spectrum

In [4]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df_true_false = pandas.read_csv("stats/kmer_spectrum/simulated_reads_95_true_false.csv", index_col=0)
df_all = pandas.read_csv("stats/kmer_spectrum/simulated_reads_95.csv", index_col=0)

df = pandas.merge(df_true_false, df_all, left_index=True, right_index=True)


fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns],
               layout=go.Layout(yaxis=dict(range=[0, 275_000]),
                                xaxis=dict(range=[0, 125]))
               )

#fig.update_layout(yaxis_type="log")
fig.show()

ModuleNotFoundError: No module named 'plotly'

#### Correction with genomic kmer

Difference between original error rate and the corrected read error rate

In [9]:
pconbr.identity.genomic_kmer("simulated_reads_95")

Unnamed: 0_level_0,s1,s2,s3,s4,s5,s6,s7,s8,s9
k,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9,0.00963,0.00922,0.00889,0.00875,0.00875,0.00875,0.00875,0.00874,0.00874
11,1.39132,1.16851,1.01168,0.90589,0.83365,0.78273,0.74338,0.71065,0.67966
13,11.67877,5.0777,2.00401,0.65655,0.06778,-0.18413,-0.28398,-0.31269,-0.30837
15,0.56856,-1.4566,-1.56729,-1.36834,-1.14115,-0.94642,-0.78921,-0.66178,-0.55848
17,-2.95722,-2.11007,-1.61401,-1.27031,-1.01657,-0.82709,-0.68288,-0.56939,-0.47824
19,-0.01686,-0.00302,-0.00054,-0.00014,-7e-05,-5e-05,-4e-05,-3e-05,-2e-05


#### Correction with reads kmer

Difference between original error rate and the corrected read error rate

In [10]:
pconbr.identity.read_kmer("simulated_reads_95")

Unnamed: 0_level_0,Unnamed: 1_level_0,s1,s2,s3,s4,s5,s6,s7,s8,s9
a,k,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,13,1e-05,1e-05,1e-05,1e-05,1e-05,1e-05,1e-05,1e-05,0.0
1,15,0.00067,0.00076,0.00073,0.00061,0.0005,0.00041,0.00031,0.00023,0.00017
1,17,0.00076,0.00056,0.00015,-6e-05,-0.00014,-0.00015,-0.00014,-0.00012,-9e-05
1,19,-3e-05,-0.00031,-0.00036,-0.00033,-0.00029,-0.00025,-0.00021,-0.00017,-0.00013
2,13,0.04044,0.03584,0.03267,0.03075,0.02955,0.02883,0.02846,0.02826,0.02814
2,15,6.55711,3.31451,1.71992,0.9609,0.57859,0.36893,0.24615,0.16933,0.11925
2,17,0.42643,-0.39124,-0.49705,-0.45951,-0.3987,-0.34152,-0.29237,-0.25137,-0.21694
2,19,-0.8778,-0.7497,-0.60991,-0.49869,-0.4114,-0.34309,-0.28894,-0.2451,-0.20903
3,13,0.15975,0.1376,0.12353,0.11487,0.10957,0.1064,0.10444,0.1033,0.10251
3,15,6.09556,2.709,1.20231,0.54095,0.23492,0.08442,0.00693,-0.03386,-0.05533


### Synthetic dataset 96

Error rate:

In [3]:
float(pconbr.identity.get_error_rate("reads/simulated_reads_96.stats"))*100

FileNotFoundError: [Errno 2] No such file or directory: 'reads/simulated_reads_95.stats'

#### Kmer spectrum

In [4]:
#!snakemake -s pipeline/generate_stat.snakefile kmer_spectrum_simulated_reads
import pandas
import plotly.graph_objects as go

df_true_false = pandas.read_csv("stats/kmer_spectrum/simulated_reads_95_true_false.csv", index_col=0)
df_all = pandas.read_csv("stats/kmer_spectrum/simulated_reads_95.csv", index_col=0)

df = pandas.merge(df_true_false, df_all, left_index=True, right_index=True)


fig = go.Figure(data=[go.Bar(name=c, y=df[c]) for c in df.columns],
               layout=go.Layout(yaxis=dict(range=[0, 275_000]),
                                xaxis=dict(range=[0, 125]))
               )

#fig.update_layout(yaxis_type="log")
fig.show()

ModuleNotFoundError: No module named 'plotly'

#### Correction with genomic kmer

Difference between original error rate and the corrected read error rate

In [9]:
pconbr.identity.genomic_kmer("simulated_reads_95")

Unnamed: 0_level_0,s1,s2,s3,s4,s5,s6,s7,s8,s9
k,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9,0.00963,0.00922,0.00889,0.00875,0.00875,0.00875,0.00875,0.00874,0.00874
11,1.39132,1.16851,1.01168,0.90589,0.83365,0.78273,0.74338,0.71065,0.67966
13,11.67877,5.0777,2.00401,0.65655,0.06778,-0.18413,-0.28398,-0.31269,-0.30837
15,0.56856,-1.4566,-1.56729,-1.36834,-1.14115,-0.94642,-0.78921,-0.66178,-0.55848
17,-2.95722,-2.11007,-1.61401,-1.27031,-1.01657,-0.82709,-0.68288,-0.56939,-0.47824
19,-0.01686,-0.00302,-0.00054,-0.00014,-7e-05,-5e-05,-4e-05,-3e-05,-2e-05


#### Correction with reads kmer

Difference between original error rate and the corrected read error rate

In [10]:
pconbr.identity.read_kmer("simulated_reads_95")

Unnamed: 0_level_0,Unnamed: 1_level_0,s1,s2,s3,s4,s5,s6,s7,s8,s9
a,k,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,13,1e-05,1e-05,1e-05,1e-05,1e-05,1e-05,1e-05,1e-05,0.0
1,15,0.00067,0.00076,0.00073,0.00061,0.0005,0.00041,0.00031,0.00023,0.00017
1,17,0.00076,0.00056,0.00015,-6e-05,-0.00014,-0.00015,-0.00014,-0.00012,-9e-05
1,19,-3e-05,-0.00031,-0.00036,-0.00033,-0.00029,-0.00025,-0.00021,-0.00017,-0.00013
2,13,0.04044,0.03584,0.03267,0.03075,0.02955,0.02883,0.02846,0.02826,0.02814
2,15,6.55711,3.31451,1.71992,0.9609,0.57859,0.36893,0.24615,0.16933,0.11925
2,17,0.42643,-0.39124,-0.49705,-0.45951,-0.3987,-0.34152,-0.29237,-0.25137,-0.21694
2,19,-0.8778,-0.7497,-0.60991,-0.49869,-0.4114,-0.34309,-0.28894,-0.2451,-0.20903
3,13,0.15975,0.1376,0.12353,0.11487,0.10957,0.1064,0.10444,0.1033,0.10251
3,15,6.09556,2.709,1.20231,0.54095,0.23492,0.08442,0.00693,-0.03386,-0.05533


## Long read correction

To evaluate our correction against other tools we : 
- result against reference genome we use [ELECTOR](//doi.org/10.1101/512889) 
- assembly result (redbean, rala, flye) we use [QUAST](//doi.org/10.1093/bioinformatics/bty266)

### Self correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|
| CONSENT    | [10.1101/546630](//doi.org/10.1101/546630)                               |
| daccord    | [10.1101/106252](//doi.org/10.1101/106252)                               |
| FLAS       | [10.1093/bioinformatics/btz206](//doi.org/10.1093/bioinformatics/btz206) |
| MECAT      | [10.1038/nmeth.4432](//doi.org/10.1038/nmeth.4432)                       |

#### Mapping result

In [28]:
display.Markdown("TODO")

AttributeError: 'function' object has no attribute 'Markdown'

#### Assembly result

In [None]:
display.Markdown("TODO")

### Hybrid correction

We compare pconbr against other self correction tools.

| Tools name | Reference                                                                |
|:-----------|:-------------------------------------------------------------------------|


#### Mapping result

In [None]:
display.Markdown("TODO")

##### Assembly result

In [None]:
display.Markdown("TODO")

## Polishing

