This tool was developed to identify and quantify the occurence of single nucleotide variants, insertions, deletions and duplications in RNA-seq data. Contrary to most tools that try to report all variants in a complete genome, here we instead propose to focus the analysis on small regions of interest.
Given a reference sequence (typically a few hundred base pairs) around a known or suspected mutation in a gene of interest, all possible sequences that can be be created between the two end k-mers according to the sequenced reads will be reported. A ratio of variant allele vs WT will be computed for each possible sequence constructed.
- Targeted variant detection using unaligned RNA-Seq reads. Life science Alliance 2019 Aug 19;2(4); doi: https://doi.org/10.26508/lsa.201900336
- Target variant detection in leukemia using unaligned RNA-Seq reads. bioRxiv 295808; doi: https://doi.org/10.1101/295808
python3 -m venv $HOME/.virtualenvs/km
source $HOME/.virtualenvs/km/bin/activate
pip install --upgrade pip setuptools wheel
pip install km-walk
- Python 3.8.0 or later with pip installed.
- The virtual environment needs to be loaded each time you open a new terminal, with this command:
$ source $HOME/.virtualenvs/km/bin/activate
- 4bp insertion in NPM1
$ cd [your_km_folder]
$ km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf | km find_report -t ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa
Sample Region Location Type Removed Added Abnormal Normal Ratio Min_coverage Exclu_min_cov Variant Target InfoVariant_sequence Reference_sequence
./data/jf/02H025_NPM1.jf chr5:171410540-171410543 chr5:171410544 ITD 0 4 | 4 2870.6 3055.2 0.484 2428 /TCTG NPM1_4ins_exons_10-11utr vs_ref AATTGCTTCCGGATGACTGACCAAGAGGCTATTCAAGATCTCTGTCTGGCAGTGGAGGAAGTCTCTTTAAGAAAATAGTTTAAA AATTGCTTCCGGATGACTGACCAAGAGGCTATTCAAGATCTCTGGCAGTGGAGGAAGTCTCTTTAAGAAAATAGTTTAAA
./data/jf/02H025_NPM1.jf - Reference 0 0 0.0 2379.0 1.000 2379 - NPM1_4ins_exons_10-11utr vs_ref
# To display kmer coverage
$ km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf -g
- ITD of 75 bp
$ cd [your_km_folder]
$ km find_mutation ./data/catalog/GRCh38/FLT3-ITD_exons_13-15.fa ./data/jf/03H116_ITD.jf | km find_report -t ./data/catalog/GRCh38/FLT3-ITD_exons_13-15.fa
Sample Region Location Type Removed Added Abnormal Normal Ratio Min_coverage Exclu_min_cov Variant Target Info Variant_sequence Reference_sequence
./data/jf/03H116_ITD.jf - Reference 0 0 0.0 443.0 1.000 912 - FLT3-ITD_exons_13-15 vs_ref
./data/jf/03H116_ITD.jf chr13:28034105-28034179 chr13:28034180 ITD 0 75 | 75 417.6 1096.7 0.276 443 /AACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACC FLT3-ITD_exons_13-15 vs_ref CTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGC CTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGC
Or you can run easy_install.sh which installs km in a virtual environement and test it as shown above. Running the script as is will install km in a virtual environment in: $HOME/.virtualenvs/km.
./easy_install.sh
km can be executed directly from source code.
- Python 3.6.0 or later
- Jellyfish 2.2 or later with Python bindings (or pyJellyfish module).
$ cd [your_km_folder]
$ python -m km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf | km find_report -t ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa
- km is designed to make targeted analysis based on target sequences. These target sequences need to be designed and given to km as input.
- A target sequence is a nucleotide sequence saved in a fasta file. Some target sequences are provided in catalog.
- To fit your specific needs, you will have to create your own target sequences.
- On generic cases, you can follow some good practices described below:
- A web portal is available to assist you in the creation of your target sequences (for cases 1 and 2).
- km-target: https://bioinfo.iric.ca/km-target/
- You could also extract nucleotide sequences from genome using severals methods, two of them are discribe below:
$ km -h
usage: PROG [-h] {find_mutation,find_report,linear_kmin,min_cov} ...
positional arguments:
{find_mutation,find_report,linear_kmin,min_cov}
sub-command help
find_mutation Identify and quantify mutations from a target sequence
and a k-mer database.
find_report Parse find_mutation output to reformat it in tabulated
file more user friendly.
linear_kmin Find min k length to decompose a target sequence in a
linear graph.
min_cov Compute coverage of target sequences.
optional arguments:
-h, --help show this help message and exit
For more detailed documentation click here.
This is the main tool of km, to identify and quantify mutations from a target sequence and a k-mer jellyfish database.
$ km find_mutation -h
$ km find_mutation [your_fasta_targetSeq] [your_jellyfish_count_table]
$ km find_mutation [your_catalog_directory] [your_jellyfish_count_table]
This tool parse find_mutation output to reformat it in more user friendly tabulated file.
$ km find_report -h
$ km find_report -t [your_fasta_targetSeq] [find_mutation_output]
$ km find_mutation [your_fasta_targetSeq] [your_jellyfish_count_table] | km find_report -t [your_fasta_targetSeq]
This tools display some k-mer's coverage stats of a target sequence and a list of jellyfish database.
$ km min_cov -h
$ km min_cov [your_fasta_targetSeq] [[your_jellyfish_count_table]...]
Length of k-mers is a central parameter:
- To produce a linear directed graph from the target sequence.
- To avoid false-positive. find_mutation shouldn't be use on jellyfish count table build with k<21 bp (we recommand k=31 bp, by default)
linear_kmin tool is design to give you the minimun k length to allow a decomposition of a target sequence in a linear graph.
$ km linear_kmin -h
$ km linear_kmin [your_catalog_directory]
In the example folder you can find a script to help you to run a km analysis on one Leucegene sample.