calc_stats.py is a command-line tool for computing nucleotide composition statistics from mitochondrial or nuclear genomes.
It supports both FASTA and GenBank input, and can process single files or entire directories recursively.
- Counts A, C, G, T, ambiguous IUPAC bases, and N.
- Calculates:
- %A, %C, %G, %T, %N, %Ambiguities
- %AT, %GC
- AT-skew = (A−T)/(A+T)
- GC-skew = (G−C)/(G+C)
- Option to exclude ambiguities from denominator (
--skip-ambig). - Supports FASTA (
.fa,.fasta,.fna,.ffn) and GenBank (.gb,.gbk,.genbank). - Recursive directory search for batch processing.
- Optional GC% sliding window profiles (
--win,--step), saved as per-sequence CSV files. - Output as tab-delimited (TSV) table, easy to import into R, Python, Excel.
Requires Python 3.8+ and Biopython:
pip install biopythonpython calc_stats.py genome.fa --out-tsv results/stats.tsv --skip-ambigpython calc_stats.py data/ --out-tsv results/all_stats.tsv --skip-ambigpython calc_stats.py data/ --out-tsv results/stats.tsv --skip-ambig --win 200 --step 50This creates additional files like seq1_gc_windows.csv with local GC% profiles.
- input_path: source file
- record_id: sequence identifier
- name: record id + description
- length: sequence length
- A, C, G, T, N, ambiguous: counts
- pct_A, pct_C, pct_G, pct_T, pct_N, pct_ambiguous: percentages
- pct_AT, pct_GC: AT/GC content
- AT_skew, GC_skew: skew indices
input_path record_id name length A C G T N ambiguous pct_A pct_C pct_G pct_T pct_N pct_ambiguous pct_AT pct_GC AT_skew GC_skew
genome.fa seq1 seq1 16569 5100 2800 2700 4969 0 0 30.8 16.9 16.3 30.0 0.0 0.0 60.8 33.2 0.013 -0.018
If you use this script in research, please cite this repository and acknowledge Biopython.
extract_dloop.py extracts the mitochondrial control region (D-loop) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, supports circular genomes, and provides robust detection heuristics.
- Robust detection of D-loop: matches
D_loop,D-loop,control_region, ormisc_featurewithnotementioning control/D-loop. - FASTA + GFF3 support (auto-pair by basename via
--gff-dir, or specify with--gff). - Heuristic fallback: extracts intergenic region between tRNA-Pro and tRNA-Phe when no explicit D-loop feature exists.
- Circular-aware extraction:
--circularallows intervals that wrap around the origin. - Policies:
--fail-policy skip|empty|errorand--multi best|keepfor multiple candidates. - QC filters:
--min-len/--max-len. - Outputs: per-record FASTA, optional combined FASTA, optional BED, and JSON sidecars (coords, strand, length, source).
- Python 3.8+
- Biopython
pip install biopythonpython extract_dloop.py genome.gb --out-dir results/dloop --circularpython extract_dloop.py data/ --out-dir results/dloop --circularpython extract_dloop.py sample.fasta --gff sample.gff3 --out-dir results/dlooppython extract_dloop.py genomes/ --gff-dir annotations/ --out-dir results/dloop --circularpython extract_dloop.py genomes/ --out-dir out/dloop --combine-out out/dloop_all.fasta --bed out/dloop.bed --circularinputs: one or more files or directories (recursive).--out-dir: directory for per-record FASTA/JSON outputs (required).--combine-out: optional path to a combined FASTA with all extracted D-loops.--gff: path to a single GFF3 file (for a single FASTA).--gff-dir: directory used to auto-match*.gff/*.gff3by basename for FASTAs.--circular: treat sequences as circular; allows wrap-around extraction.--fail-policy {skip,empty,error}: behavior when D-loop is not found (default:skip).--multi {best,keep}: keep all candidates (keep) or only the best (best, default).--min-len / --max-len: filter sequences by length (0 disables).--bed: optional BED file to append intervals (wrap-around not split).--log-level: logging level (INFO,DEBUG, etc.).
out_dir/<record_id>_dloop.fasta(or_dloop_heur.fastawhen heuristic is used)out_dir/<record_id>_dloop.jsonwith metadata--combine-out→ single FASTA with all extracted D-loops--bed→ BED file with intervals
FASTA header example:
>NC_XXXX | dloop | coords=15432..16569 | strand=+ | source=genome.gb
AACCTTG...
JSON sidecar example:
{
"record_id": "NC_XXXX",
"source": "genome.gb",
"coords": [15432, 16569],
"strand": "+",
"length": 1138,
"rank": 1,
"type": "control_region",
"note": "putative control region"
}- For FASTA without annotations, use
--gff/--gff-diror rely on the Pro↔Phe heuristic. - BED cannot represent wrap-around intervals in a single line; the script writes the raw [start, end) as-is.
- For production-grade GFF handling (phase, attributes), consider
gffutils.
The current parser is minimal and focused on D-loop/tRNA features.
If this tool contributes to your research, please cite this repository and acknowledge Biopython.
extract_pcgs.py extracts mitochondrial protein-coding genes (PCGs) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, and can emit DNA or translated proteins.
It also supports concatenating all PCGs per record.
- CDS-aware extraction from GenBank (features of type
CDS) or GFF3 (entries withtype=CDS). - Groups multiple CDS by
Parent/ID(GFF3) or bygene/locus_tag(GenBank). - Per-gene FASTA output (default) or a single concatenated sequence (
--mode concat) in genomic order. - Optional protein translation (
--protein) with configurable NCBI translation table (--transl-table, default 2: Vertebrate Mitochondrial). - Circular-aware extraction (wrap-around exons).
- Optional combined FASTA, BED with exon intervals, and JSON sidecars with metadata.
- Filters by length (
--min-len,--max-len).
- Python 3.8+
- Biopython
pip install biopythonpython extract_pcgs.py genome.gb --out-dir results/pcgs --protein --transl-table 2python extract_pcgs.py data/ --out-dir results/pcgspython extract_pcgs.py sample.fasta --gff sample.gff3 --out-dir results/pcgs --mode per-genepython extract_pcgs.py genomes/ --gff-dir annotations/ --out-dir results/pcgs --mode concat --proteinpython extract_pcgs.py genomes/ --out-dir out/pcgs --combine-out out/pcgs_all.fasta --bed out/pcgs_exons.bed --protein --transl-table 2inputs: one or more files or directories (recursive).--out-dir: directory for per-gene FASTA/JSON (required).--combine-out: optional path to a combined FASTA.--gff: path to a single GFF3 file (for a single FASTA).--gff-dir: directory used to auto-match*.gff/*.gff3by basename.--circular: treat sequences as circular (wrap-around exons).--protein: output proteins (AA) instead of DNA.--transl-table: NCBI translation table (default 2 = Vertebrate Mitochondrial).--mode {per-gene,concat}: emit per gene (default) or concatenated PCGs per record.--min-len / --max-len: filter sequences by length (0 disables).--bed: optional BED file with exon intervals.--log-level: logging level.
out_dir/<record_id>_<gene_id>.fna(DNA) or.faa(protein).out_dir/<record_id>_<gene_id>.jsonmetadata (coords, strand, exon list, product).--mode concat→<record_id>_PCGs_concat.fna/.faa+.json.--combine-out→ combined FASTA with all outputs.--bed→ BED file with exon intervals (0-based, end-exclusive).
FASTA header example:
>NC_XXXX | ND2 | product=NADH dehydrogenase subunit 2 | exons=1 | source=genome.gb
ATG...
JSON sidecar example:
{
"record_id": "NC_XXXX",
"source": "genome.gb",
"gene_id": "ND2",
"product": "NADH dehydrogenase subunit 2",
"protein": false,
"transl_table": 2,
"length": 1047,
"exons": [[4586, 5633, "+"]]
}- The GFF3 parser is minimal, focused on
CDSrecords. For complex eukaryotic models, considergffutils. - Translation stops at the first stop codon (
to_stop=True). For quality control, inspect frames and partials. - Mitogenomes typically have single-exon CDS, but the code supports multi-exon models and strand handling.
- For concatenation, genes are ordered by the start of their first exon.
If this tool contributes to your research, please cite this repository and acknowledge Biopython.
extract_rrna.py extracts mitochondrial rRNAs (12S and 16S) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, includes robust matching logic, and supports circular genomes.
- Detects rRNAs from GenBank (
feature.type == 'rRNA') or GFF3 (type=rRNA). - Robust matching by
product/note/gene/Name/IDcontaining12Sor16S(also accepts common synonyms such as ssu/lsu, rrnS/rrnL). - Circular-aware extraction (wrap-around intervals).
- Fail policies:
--fail-policy skip|empty|error. - Optional fallback by FASTA header (
--fasta-header-fallback) when no features exist. - QC filters:
--min-len/--max-len. - Outputs: per-rRNA FASTA, optional combined FASTA, optional BED, and JSON sidecars with metadata.
- Python 3.8+
- Biopython
pip install biopythonpython extract_rrna.py genome.gb --out-dir results/rrnapython extract_rrna.py data/ --out-dir results/rrna --circularpython extract_rrna.py sample.fasta --gff sample.gff3 --out-dir results/rrnapython extract_rrna.py genomes/ --gff-dir annotations/ --out-dir results/rrna --circularpython extract_rrna.py genomes/ --out-dir out/rrna --combine-out out/rrna_all.fasta --bed out/rrna_regions.bed --circularinputs: one or more files or directories (recursive).--out-dir: directory for per-record FASTA/JSON outputs (required).--combine-out: optional path to a combined FASTA.--gff: path to a single GFF3 file (for a single FASTA).--gff-dir: directory used to auto-match*.gff/*.gff3by basename.--circular: treat sequences as circular; allows wrap-around extraction.--only {12S,16S}: restrict extraction to one rRNA.--min-len / --max-len: filter sequences by length (0 disables).--bed: optional BED file for intervals.--fasta-header-fallback: if no features, try to detect 12S/16S from FASTA headers (use with caution).--fail-policy {skip,empty,error}: behavior when targets are missing.--log-level: logging level.
out_dir/<record_id>_12S.fna(or_16S.fna), per record.out_dir/<record_id>_<tag>.jsonsidecar with metadata (coords, strand, length, label, notes).--combine-out→ combined FASTA.--bed→ BED file with intervals (0-based, end-exclusive).
FASTA header example:
>NC_XXXX | 12S | coords=1481..2463 | strand=+ | source=genome.gb
ATG...
JSON sidecar example:
{
"record_id": "NC_XXXX",
"source": "genome.gb",
"coords": [1481, 2463],
"strand": "+",
"length": 983,
"label": "12S",
"type": "rRNA",
"note": "12S ribosomal RNA"
}- The GFF3 parser is minimal, focused on
rRNArecords and common attributes. - FASTA header fallback is best-effort; prefer annotated inputs (GenBank or GFF3).
- For wrap-around intervals, BED cannot represent a single row split; the script writes raw [start,end).
If this tool contributes to your research, please cite this repository and acknowledge Biopython.
extract_trna.py extracts mitochondrial tRNAs from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, labels each tRNA (e.g., tRNA-Phe/trnF), and is aware of circular genomes.
- Detects tRNAs from GenBank (
feature.type == 'tRNA') or GFF3 (type=tRNA). - Robust labeling from
product/note/gene/Name/ID(supports formatstRNA-Phe,trnF,(Phe)). - Circular-aware extraction (wrap-around intervals).
- Fail policies:
--fail-policy skip|empty|error. - QC filters:
--min-len/--max-len. - Outputs: per-tRNA FASTA, optional combined FASTA, optional BED, and JSON sidecars with metadata.
- Python 3.8+
- Biopython
pip install biopythonpython extract_trna.py genome.gb --out-dir results/trnapython extract_trna.py data/ --out-dir results/trna --circularpython extract_trna.py sample.fasta --gff sample.gff3 --out-dir results/trnapython extract_trna.py genomes/ --gff-dir annotations/ --out-dir results/trna --circularpython extract_trna.py genomes/ --out-dir out/trna --combine-out out/trna_all.fasta --bed out/trna_regions.bed --circularinputs: one or more files or directories (recursive).--out-dir: directory for per-record FASTA/JSON outputs (required).--combine-out: optional path to a combined FASTA.--gff: path to a single GFF3 file (for a single FASTA).--gff-dir: directory used to auto-match*.gff/*.gff3by basename.--circular: treat sequences as circular; allows wrap-around extraction.--only: restrict extraction to a specific tRNA label (e.g.,Phe,Pro,tRNA-Phe,trnF).--min-len / --max-len: filter sequences by length (0 disables).--bed: optional BED file for intervals.--fail-policy {skip,empty,error}: behavior when nothing is found.--log-level: logging level.
out_dir/<record_id>_tRNA-XXX.fna(one file per tRNA).out_dir/<record_id>_tRNA-XXX.jsonsidecar with metadata (coords, strand, label).--combine-out→ combined FASTA.--bed→ BED file with intervals (0-based, end-exclusive).
FASTA header example:
>NC_XXXX | tRNA-Phe | coords=521..589 | strand=+ | source=genome.gb
JSON sidecar example:
{
"record_id": "NC_XXXX",
"source": "genome.gb",
"coords": [521, 589],
"strand": "+",
"length": 69,
"label": "tRNA-Phe",
"type": "tRNA",
"note": "tRNA-Phe (trnF)"
}- The GFF3 parser is minimal, focused on
tRNArecords and common attributes. - Label inference uses heuristics; verify naming conventions in heterogeneous annotations.
- BED cannot represent wrap-around intervals in a single line; the script writes raw [start,end).
If this tool contributes to your research, please cite this repository and acknowledge Biopython.