Calculate various statistics from a long read sequencing dataset in fastq, bam or albacore sequencing summary format.
pip install nanostat
or
conda install -c bioconda nanostat
NanoStat [-h] [-v] [-o OUTDIR] [-p PREFIX] [-n NAME] [-t N]
[--barcoded] [--readtype {1D,2D,1D2}]
(--fastq file [file ...] | --fasta file [file ...] | --summary file [file ...] | --bam file [file ...])
Calculate statistics of long read sequencing dataset.
General options:
-h, --help show the help and exit
-v, --version Print version and exit.
-o, --outdir OUTDIR Specify directory in which output has to be created.
-p, --prefix PREFIX Specify an optional prefix to be used for the output file.
-n, --name NAME Specify a filename/path for the output, stdout is the default.
-t, --threads N Set the allowed number of threads to be used by the script.
Input options.:
--barcoded Use if you want to split the summary file by barcode
--readtype {1D,2D,1D2}
Which read type to extract information about from summary. Options are 1D, 2D,
1D2
Input data sources, one of these is required.:
--fastq file [file ...]
Data is in one or more (compressed) fastq file(s).
--fasta file [file ...]
Data is in one or more (compressed) fasta file(s).
--summary file [file ...]
Data is in one or more (compressed) summary file(s)generated by albacore.
--bam file [file ...]
Data is in one or more sorted bam file(s).
EXAMPLES:
NanoStat --fastq reads.fastq.gz --outdir statreports
NanoStat --summary sequencing_summary1.txt sequencing_summary2.txtsequencing_summary3.txt --readtype 1D2
NanoStat --bam alignment.bam alignment2.bam
NanoStat --fastq reads.fastq.gz --outdir statreports
NanoStat --summary sequencing_summary1.txt sequencing_summary2.txt sequencing_summary3.txt --readtype 1D2
NanoStat --bam alignment.bam alignment2.bam
General summary:
Number of reads: 3995
Total bases: 11418359
Median read length: 1221.0
Mean read length: 2858.2
Read length N50: 8676
Active channels: 933
Mean read quality: 10.2
Median read quality: 10.6
Top 5 longest reads and their mean basecall quality score
1: 36928 (10.8, [a9dbd2b5-718c-4d0c-afa8-a12a54a5a12a])
2: 32830 (10.2, [b87fc717-1cf8-4526-9f96-3042fda5b769])
3: 30474 (12.4, [ea3e43d8-6cbf-4687-95bd-66e6123512d4])
4: 27531 (12.5, [74c0e08c-eb94-4825-b93b-21d63e05cf14])
5: 26535 (10.4, [8e6ed505-8477-4462-9f0a-3a72783cbf60])
Top 5 highest mean basecall quality scores and their read lengths
1: 14.8 (1040, [acf6f90b-ea22-4960-8049-6e6e694a3f9a])
2: 14.7 (9603, [ec796da1-5c4a-4350-974b-6dabb8deb546])
3: 14.6 (680, [792c485a-81cb-4ef7-8f23-01f10f9c7c23])
4: 14.5 (2664, [d8092ffb-9919-42fb-ad41-34b1658f1bd5])
5: 14.5 (909, [d55d3bf6-0729-4b46-82cd-0cef00bcf849])
Number and percentage of reads above quality cutoffs
>Q5: 3559 (89.1%)
>Q7: 3429 (85.8%)
>Q10: 2705 (67.7%)
>Q12: 1072 (26.8%)
>Q15: 0 (0.0%)
I welcome all suggestions, bug reports, feature requests and contributions. Please leave an issue or open a pull request. I will usually respond within a day, or rarely within a few days.