Skip to content
/ km Public

km : a software for RNA-seq investigation using k-mer decomposition

License

Notifications You must be signed in to change notification settings

iric-soft/km

Repository files navigation

km : a software for RNA-seq investigation using k-mer decomposition

pyversion pypi codecov

Introduction:

This tool was developed to identify and quantify the occurence of single nucleotide variants, insertions, deletions and duplications in RNA-seq data. Contrary to most tools that try to report all variants in a complete genome, here we instead propose to focus the analysis on small regions of interest.

Given a reference sequence (typically a few hundred base pairs) around a known or suspected mutation in a gene of interest, all possible sequences that can be be created between the two end k-mers according to the sequenced reads will be reported. A ratio of variant allele vs WT will be computed for each possible sequence constructed.

Citing:

Install:

Using pip (recommended)

python3 -m venv $HOME/.virtualenvs/km
source $HOME/.virtualenvs/km/bin/activate
pip install --upgrade pip setuptools wheel
pip install km-walk

Requirements:

  • Python 3.8.0 or later with pip installed.

Usage:

  • The virtual environment needs to be loaded each time you open a new terminal, with this command:
$ source $HOME/.virtualenvs/km/bin/activate

Test:

  • 4bp insertion in NPM1
$ cd [your_km_folder]
$ km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf | km find_report -t ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa
Sample        Region  Location        Type    Removed Added   Abnormal        Normal  Ratio   Min_coverage    Exclu_min_cov   Variant Target  InfoVariant_sequence    Reference_sequence
./data/jf/02H025_NPM1.jf      chr5:171410540-171410543        chr5:171410544  ITD     0       4 | 4   2870.6  3055.2  0.484   2428            /TCTG   NPM1_4ins_exons_10-11utr        vs_ref  AATTGCTTCCGGATGACTGACCAAGAGGCTATTCAAGATCTCTGTCTGGCAGTGGAGGAAGTCTCTTTAAGAAAATAGTTTAAA    AATTGCTTCCGGATGACTGACCAAGAGGCTATTCAAGATCTCTGGCAGTGGAGGAAGTCTCTTTAAGAAAATAGTTTAAA
./data/jf/02H025_NPM1.jf              -       Reference       0       0       0.0     2379.0  1.000   2379            -       NPM1_4ins_exons_10-11utr        vs_ref
# To display kmer coverage
$ km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf -g
  • ITD of 75 bp
$ cd [your_km_folder]
$ km find_mutation ./data/catalog/GRCh38/FLT3-ITD_exons_13-15.fa ./data/jf/03H116_ITD.jf | km find_report -t ./data/catalog/GRCh38/FLT3-ITD_exons_13-15.fa
Sample        Region  Location        Type    Removed Added   Abnormal        Normal  Ratio   Min_coverage    Exclu_min_cov   Variant Target  Info    Variant_sequence        Reference_sequence
./data/jf/03H116_ITD.jf               -       Reference       0       0       0.0     443.0   1.000   912             -       FLT3-ITD_exons_13-15    vs_ref
./data/jf/03H116_ITD.jf       chr13:28034105-28034179 chr13:28034180  ITD     0       75 | 75 417.6   1096.7  0.276   443             /AACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACC    FLT3-ITD_exons_13-15    vs_ref  CTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGC    CTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGC

Bootstrap:

Or you can run easy_install.sh which installs km in a virtual environement and test it as shown above. Running the script as is will install km in a virtual environment in: $HOME/.virtualenvs/km.

./easy_install.sh

From source code

km can be executed directly from source code.

Requirements:

  • Python 3.6.0 or later
  • Jellyfish 2.2 or later with Python bindings (or pyJellyfish module).

Usage:

$ cd [your_km_folder]
$ python -m km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf | km find_report -t ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa

Design your target sequence:

  • km is designed to make targeted analysis based on target sequences. These target sequences need to be designed and given to km as input.
  • A target sequence is a nucleotide sequence saved in a fasta file. Some target sequences are provided in catalog.
  • To fit your specific needs, you will have to create your own target sequences.
  • On generic cases, you can follow some good practices described below:

https://raw.githubusercontent.com/iric-soft/km/master/data/figure/doc_target_sequence.png

  • A web portal is available to assist you in the creation of your target sequences (for cases 1 and 2).
  • You could also extract nucleotide sequences from genome using severals methods, two of them are discribe below:
    • Using samtools: samtools faidx chr2:25234341-25234405 GRCh38/genome.fa
    • Using get DNA from ucsc.

Display help:

$ km -h
  usage: PROG [-h] {find_mutation,find_report,linear_kmin,min_cov} ...

  positional arguments:
    {find_mutation,find_report,linear_kmin,min_cov}
                          sub-command help
      find_mutation       Identify and quantify mutations from a target sequence
                          and a k-mer database.
      find_report         Parse find_mutation output to reformat it in tabulated
                          file more user friendly.
      linear_kmin         Find min k length to decompose a target sequence in a
                          linear graph.
      min_cov             Compute coverage of target sequences.

  optional arguments:
    -h, --help            show this help message and exit

km's tools overview:

For more detailed documentation click here.

find_mutation:

This is the main tool of km, to identify and quantify mutations from a target sequence and a k-mer jellyfish database.

$ km find_mutation -h
$ km find_mutation [your_fasta_targetSeq] [your_jellyfish_count_table]
$ km find_mutation [your_catalog_directory] [your_jellyfish_count_table]

find_report:

This tool parse find_mutation output to reformat it in more user friendly tabulated file.

$ km find_report -h
$ km find_report -t [your_fasta_targetSeq] [find_mutation_output]
$ km find_mutation [your_fasta_targetSeq] [your_jellyfish_count_table] | km find_report -t [your_fasta_targetSeq]

min_cov:

This tools display some k-mer's coverage stats of a target sequence and a list of jellyfish database.

$ km min_cov -h
$ km min_cov [your_fasta_targetSeq] [[your_jellyfish_count_table]...]

linear_kmin:

Length of k-mers is a central parameter:

  • To produce a linear directed graph from the target sequence.
  • To avoid false-positive. find_mutation shouldn't be use on jellyfish count table build with k<21 bp (we recommand k=31 bp, by default)

linear_kmin tool is design to give you the minimun k length to allow a decomposition of a target sequence in a linear graph.

$ km linear_kmin -h
$ km linear_kmin [your_catalog_directory]

Runing km on a real sample from downloaded fastq:

In the example folder you can find a script to help you to run a km analysis on one Leucegene sample.