-
Notifications
You must be signed in to change notification settings - Fork 0
most freq subseq
masikol edited this page Oct 18, 2022
·
6 revisions
This script finds N most frequently occuring subsequenсes of given length for each sequence in FASTA file.
Format of input: *.fasta(.gz)
or *.fa(.gz)
file and length of query subsequence.
The script is written in Python, so you need Python interpreter (version 3.X) to use it. Here you can download Python.
-h (--help): print help message;
-v (--version): print version;
-l (--query-length): length of subsequence to search;
-s (--subject): input FASTA file;
-N (--num-top-occur): number of top the highest frequencies to show in output.
Default value is 1 (i.e. show only the most frequently occuring subsequences)
--both-strands -- count subsequences on both strands
(default behaviour is to count only subsequences on positive strand).
Find the 3 the most abundant subsequences of length 16 for each sequence in file my_favourite_genome.fasta.gz
:
python3 most-freq-subseq.py -l 16 -s my_favourite_genome.fasta.gz -N 3