Skip to content

most freq subseq

masikol edited this page Oct 18, 2022 · 6 revisions

most-freq-subseq

Description

This script finds N most frequently occuring subsequenсes of given length for each sequence in FASTA file.

Format of input: *.fasta(.gz) or *.fa(.gz) file and length of query subsequence.

Dependencies

The script is written in Python, so you need Python interpreter (version 3.X) to use it. Here you can download Python.

Options

-h (--help): print help message;

-v (--version): print version;

-l (--query-length): length of subsequence to search;

-s (--subject): input FASTA file;

-N (--num-top-occur): number of top the highest frequencies to show in output.
   Default value is 1 (i.e. show only the most frequently occuring subsequences)

--both-strands -- count subsequences on both strands
(default behaviour is to count only subsequences on positive strand).

Usage

Find the 3 the most abundant subsequences of length 16 for each sequence in file my_favourite_genome.fasta.gz:

python3 most-freq-subseq.py -l 16 -s my_favourite_genome.fasta.gz -N 3
Clone this wiki locally