mitoTools

calc_stats.py

calc_stats.py is a command-line tool for computing nucleotide composition statistics from mitochondrial or nuclear genomes.
It supports both FASTA and GenBank input, and can process single files or entire directories recursively.

Features

Counts A, C, G, T, ambiguous IUPAC bases, and N.
Calculates:
- %A, %C, %G, %T, %N, %Ambiguities
- %AT, %GC
- AT-skew = (A−T)/(A+T)
- GC-skew = (G−C)/(G+C)
Option to exclude ambiguities from denominator (--skip-ambig).
Supports FASTA (.fa, .fasta, .fna, .ffn) and GenBank (.gb, .gbk, .genbank).
Recursive directory search for batch processing.
Optional GC% sliding window profiles (--win, --step), saved as per-sequence CSV files.
Output as tab-delimited (TSV) table, easy to import into R, Python, Excel.

Installation

Requires Python 3.8+ and Biopython:

pip install biopython

Usage

Single FASTA file

python calc_stats.py genome.fa --out-tsv results/stats.tsv --skip-ambig

Directory with multiple FASTAs (recursive)

python calc_stats.py data/ --out-tsv results/all_stats.tsv --skip-ambig

With GC sliding windows

python calc_stats.py data/ --out-tsv results/stats.tsv --skip-ambig --win 200 --step 50

This creates additional files like seq1_gc_windows.csv with local GC% profiles.

Output columns (TSV)

input_path: source file
record_id: sequence identifier
name: record id + description
length: sequence length
A, C, G, T, N, ambiguous: counts
pct_A, pct_C, pct_G, pct_T, pct_N, pct_ambiguous: percentages
pct_AT, pct_GC: AT/GC content
AT_skew, GC_skew: skew indices

Example output

input_path   record_id   name   length   A   C   G   T   N   ambiguous   pct_A   pct_C   pct_G   pct_T   pct_N   pct_ambiguous   pct_AT   pct_GC   AT_skew   GC_skew
genome.fa    seq1        seq1   16569    5100 2800 2700 4969 0   0   30.8   16.9   16.3   30.0   0.0   0.0   60.8   33.2   0.013   -0.018

Citation

If you use this script in research, please cite this repository and acknowledge Biopython.

extract_dloop.py

extract_dloop.py extracts the mitochondrial control region (D-loop) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, supports circular genomes, and provides robust detection heuristics.

Key features

Robust detection of D-loop: matches D_loop, D-loop, control_region, or misc_feature with note mentioning control/D-loop.
FASTA + GFF3 support (auto-pair by basename via --gff-dir, or specify with --gff).
Heuristic fallback: extracts intergenic region between tRNA-Pro and tRNA-Phe when no explicit D-loop feature exists.
Circular-aware extraction: --circular allows intervals that wrap around the origin.
Policies: --fail-policy skip|empty|error and --multi best|keep for multiple candidates.
QC filters: --min-len / --max-len.
Outputs: per-record FASTA, optional combined FASTA, optional BED, and JSON sidecars (coords, strand, length, source).

Installation

Python 3.8+
Biopython

pip install biopython

Usage

Single GenBank

python extract_dloop.py genome.gb --out-dir results/dloop --circular

Directory with mixed inputs (recursive)

python extract_dloop.py data/ --out-dir results/dloop --circular

FASTA + GFF3 (explicit file)

python extract_dloop.py sample.fasta --gff sample.gff3 --out-dir results/dloop

FASTA + GFF3 (auto-match by basename in a directory)

python extract_dloop.py genomes/ --gff-dir annotations/ --out-dir results/dloop --circular

Combined FASTA and BED export

python extract_dloop.py genomes/   --out-dir out/dloop   --combine-out out/dloop_all.fasta   --bed out/dloop.bed   --circular

Options

inputs: one or more files or directories (recursive).
--out-dir: directory for per-record FASTA/JSON outputs (required).
--combine-out: optional path to a combined FASTA with all extracted D-loops.
--gff: path to a single GFF3 file (for a single FASTA).
--gff-dir: directory used to auto-match *.gff/*.gff3 by basename for FASTAs.
--circular: treat sequences as circular; allows wrap-around extraction.
--fail-policy {skip,empty,error}: behavior when D-loop is not found (default: skip).
--multi {best,keep}: keep all candidates (keep) or only the best (best, default).
--min-len / --max-len: filter sequences by length (0 disables).
--bed: optional BED file to append intervals (wrap-around not split).
--log-level: logging level (INFO, DEBUG, etc.).

Output

out_dir/<record_id>_dloop.fasta (or _dloop_heur.fasta when heuristic is used)
out_dir/<record_id>_dloop.json with metadata
--combine-out → single FASTA with all extracted D-loops
--bed → BED file with intervals

FASTA header example:

>NC_XXXX | dloop | coords=15432..16569 | strand=+ | source=genome.gb
AACCTTG...

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "coords": [15432, 16569],
  "strand": "+",
  "length": 1138,
  "rank": 1,
  "type": "control_region",
  "note": "putative control region"
}

Notes & caveats

For FASTA without annotations, use --gff/--gff-dir or rely on the Pro↔Phe heuristic.
BED cannot represent wrap-around intervals in a single line; the script writes the raw [start, end) as-is.
For production-grade GFF handling (phase, attributes), consider gffutils.
The current parser is minimal and focused on D-loop/tRNA features.

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.

extract_pcgs.py

extract_pcgs.py extracts mitochondrial protein-coding genes (PCGs) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, and can emit DNA or translated proteins.
It also supports concatenating all PCGs per record.

Key features

CDS-aware extraction from GenBank (features of type CDS) or GFF3 (entries with type=CDS).
Groups multiple CDS by Parent/ID (GFF3) or by gene/locus_tag (GenBank).
Per-gene FASTA output (default) or a single concatenated sequence (--mode concat) in genomic order.
Optional protein translation (--protein) with configurable NCBI translation table (--transl-table, default 2: Vertebrate Mitochondrial).
Circular-aware extraction (wrap-around exons).
Optional combined FASTA, BED with exon intervals, and JSON sidecars with metadata.
Filters by length (--min-len, --max-len).

Installation

Python 3.8+
Biopython

pip install biopython

Usage

Single GenBank

python extract_pcgs.py genome.gb --out-dir results/pcgs --protein --transl-table 2

Directory with mixed inputs (recursive)

python extract_pcgs.py data/ --out-dir results/pcgs

FASTA + GFF3 (explicit file)

python extract_pcgs.py sample.fasta --gff sample.gff3 --out-dir results/pcgs --mode per-gene

FASTA + GFF3 (auto-match by basename in a directory)

python extract_pcgs.py genomes/ --gff-dir annotations/ --out-dir results/pcgs --mode concat --protein

Combined FASTA and BED export

python extract_pcgs.py genomes/   --out-dir out/pcgs   --combine-out out/pcgs_all.fasta   --bed out/pcgs_exons.bed   --protein --transl-table 2

Options

inputs: one or more files or directories (recursive).
--out-dir: directory for per-gene FASTA/JSON (required).
--combine-out: optional path to a combined FASTA.
--gff: path to a single GFF3 file (for a single FASTA).
--gff-dir: directory used to auto-match *.gff/*.gff3 by basename.
--circular: treat sequences as circular (wrap-around exons).
--protein: output proteins (AA) instead of DNA.
--transl-table: NCBI translation table (default 2 = Vertebrate Mitochondrial).
--mode {per-gene,concat}: emit per gene (default) or concatenated PCGs per record.
--min-len / --max-len: filter sequences by length (0 disables).
--bed: optional BED file with exon intervals.
--log-level: logging level.

Output

out_dir/<record_id>_<gene_id>.fna (DNA) or .faa (protein).
out_dir/<record_id>_<gene_id>.json metadata (coords, strand, exon list, product).
--mode concat → <record_id>_PCGs_concat.fna/.faa + .json.
--combine-out → combined FASTA with all outputs.
--bed → BED file with exon intervals (0-based, end-exclusive).

FASTA header example:

>NC_XXXX | ND2 | product=NADH dehydrogenase subunit 2 | exons=1 | source=genome.gb
ATG...

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "gene_id": "ND2",
  "product": "NADH dehydrogenase subunit 2",
  "protein": false,
  "transl_table": 2,
  "length": 1047,
  "exons": [[4586, 5633, "+"]]
}

Notes & caveats

The GFF3 parser is minimal, focused on CDS records. For complex eukaryotic models, consider gffutils.
Translation stops at the first stop codon (to_stop=True). For quality control, inspect frames and partials.
Mitogenomes typically have single-exon CDS, but the code supports multi-exon models and strand handling.
For concatenation, genes are ordered by the start of their first exon.

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.

extract_rrna.py

extract_rrna.py extracts mitochondrial rRNAs (12S and 16S) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, includes robust matching logic, and supports circular genomes.

Key features

Detects rRNAs from GenBank (feature.type == 'rRNA') or GFF3 (type=rRNA).
Robust matching by product/note/gene/Name/ID containing 12S or 16S (also accepts common synonyms such as ssu/lsu, rrnS/rrnL).
Circular-aware extraction (wrap-around intervals).
Fail policies: --fail-policy skip|empty|error.
Optional fallback by FASTA header (--fasta-header-fallback) when no features exist.
QC filters: --min-len / --max-len.
Outputs: per-rRNA FASTA, optional combined FASTA, optional BED, and JSON sidecars with metadata.

Installation

Python 3.8+
Biopython

pip install biopython

Usage

Single GenBank

python extract_rrna.py genome.gb --out-dir results/rrna

Directory with mixed inputs (recursive)

python extract_rrna.py data/ --out-dir results/rrna --circular

FASTA + GFF3 (explicit file)

python extract_rrna.py sample.fasta --gff sample.gff3 --out-dir results/rrna

FASTA + GFF3 (auto-match by basename in a directory)

python extract_rrna.py genomes/ --gff-dir annotations/ --out-dir results/rrna --circular

Combined FASTA and BED export

python extract_rrna.py genomes/   --out-dir out/rrna   --combine-out out/rrna_all.fasta   --bed out/rrna_regions.bed   --circular

Options

inputs: one or more files or directories (recursive).
--out-dir: directory for per-record FASTA/JSON outputs (required).
--combine-out: optional path to a combined FASTA.
--gff: path to a single GFF3 file (for a single FASTA).
--gff-dir: directory used to auto-match *.gff/*.gff3 by basename.
--circular: treat sequences as circular; allows wrap-around extraction.
--only {12S,16S}: restrict extraction to one rRNA.
--min-len / --max-len: filter sequences by length (0 disables).
--bed: optional BED file for intervals.
--fasta-header-fallback: if no features, try to detect 12S/16S from FASTA headers (use with caution).
--fail-policy {skip,empty,error}: behavior when targets are missing.
--log-level: logging level.

Output

out_dir/<record_id>_12S.fna (or _16S.fna), per record.
out_dir/<record_id>_<tag>.json sidecar with metadata (coords, strand, length, label, notes).
--combine-out → combined FASTA.
--bed → BED file with intervals (0-based, end-exclusive).

FASTA header example:

>NC_XXXX | 12S | coords=1481..2463 | strand=+ | source=genome.gb
ATG...

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "coords": [1481, 2463],
  "strand": "+",
  "length": 983,
  "label": "12S",
  "type": "rRNA",
  "note": "12S ribosomal RNA"
}

Notes & caveats

The GFF3 parser is minimal, focused on rRNA records and common attributes.
FASTA header fallback is best-effort; prefer annotated inputs (GenBank or GFF3).
For wrap-around intervals, BED cannot represent a single row split; the script writes raw [start,end).

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.

extract_trna.py

extract_trna.py extracts mitochondrial tRNAs from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, labels each tRNA (e.g., tRNA-Phe/trnF), and is aware of circular genomes.

Key features

Detects tRNAs from GenBank (feature.type == 'tRNA') or GFF3 (type=tRNA).
Robust labeling from product / note / gene / Name / ID (supports formats tRNA-Phe, trnF, (Phe)).
Circular-aware extraction (wrap-around intervals).
Fail policies: --fail-policy skip|empty|error.
QC filters: --min-len / --max-len.
Outputs: per-tRNA FASTA, optional combined FASTA, optional BED, and JSON sidecars with metadata.

Installation

Python 3.8+
Biopython

pip install biopython

Usage

Single GenBank

python extract_trna.py genome.gb --out-dir results/trna

Directory with mixed inputs (recursive)

python extract_trna.py data/ --out-dir results/trna --circular

FASTA + GFF3 (explicit file)

python extract_trna.py sample.fasta --gff sample.gff3 --out-dir results/trna

FASTA + GFF3 (auto-match by basename)

python extract_trna.py genomes/ --gff-dir annotations/ --out-dir results/trna --circular

Combined FASTA and BED export

python extract_trna.py genomes/   --out-dir out/trna   --combine-out out/trna_all.fasta   --bed out/trna_regions.bed   --circular

Options

inputs: one or more files or directories (recursive).
--out-dir: directory for per-record FASTA/JSON outputs (required).
--combine-out: optional path to a combined FASTA.
--gff: path to a single GFF3 file (for a single FASTA).
--gff-dir: directory used to auto-match *.gff/*.gff3 by basename.
--circular: treat sequences as circular; allows wrap-around extraction.
--only: restrict extraction to a specific tRNA label (e.g., Phe, Pro, tRNA-Phe, trnF).
--min-len / --max-len: filter sequences by length (0 disables).
--bed: optional BED file for intervals.
--fail-policy {skip,empty,error}: behavior when nothing is found.
--log-level: logging level.

Output

out_dir/<record_id>_tRNA-XXX.fna (one file per tRNA).
out_dir/<record_id>_tRNA-XXX.json sidecar with metadata (coords, strand, label).
--combine-out → combined FASTA.
--bed → BED file with intervals (0-based, end-exclusive).

FASTA header example:

>NC_XXXX | tRNA-Phe | coords=521..589 | strand=+ | source=genome.gb

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "coords": [521, 589],
  "strand": "+",
  "length": 69,
  "label": "tRNA-Phe",
  "type": "tRNA",
  "note": "tRNA-Phe (trnF)"
}

Notes & caveats

The GFF3 parser is minimal, focused on tRNA records and common attributes.
Label inference uses heuristics; verify naming conventions in heterogeneous annotations.
BED cannot represent wrap-around intervals in a single line; the script writes raw [start,end).

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
calc_stats.py		calc_stats.py
extract_dloop.py		extract_dloop.py
extract_pcgs.py		extract_pcgs.py
extract_rrna.py		extract_rrna.py
extract_trna.py		extract_trna.py

mitOmics/mitoTools

Folders and files

Latest commit

History

Repository files navigation

mitoTools

calc_stats.py

Features

Installation

Usage

Single FASTA file

Directory with multiple FASTAs (recursive)

With GC sliding windows

Output columns (TSV)

Example output

Citation

extract_dloop.py

Key features

Installation

Usage

Single GenBank

Directory with mixed inputs (recursive)

FASTA + GFF3 (explicit file)

FASTA + GFF3 (auto-match by basename in a directory)

Combined FASTA and BED export

Options

Output

Notes & caveats

Citation

extract_pcgs.py

Key features

Installation

Usage

Single GenBank

Directory with mixed inputs (recursive)

FASTA + GFF3 (explicit file)

FASTA + GFF3 (auto-match by basename in a directory)

Combined FASTA and BED export

Options

Output

Notes & caveats

Citation

extract_rrna.py

Key features

Installation

Usage

Single GenBank

Directory with mixed inputs (recursive)

FASTA + GFF3 (explicit file)

FASTA + GFF3 (auto-match by basename in a directory)

Combined FASTA and BED export

Options

Output

Notes & caveats

Citation

extract_trna.py

Key features

Installation

Usage

Single GenBank

Directory with mixed inputs (recursive)

FASTA + GFF3 (explicit file)

FASTA + GFF3 (auto-match by basename)

Combined FASTA and BED export

Options

Output

Notes & caveats

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages