Skip to content

mitOmics/mitoTools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mitoTools


calc_stats.py

calc_stats.py is a command-line tool for computing nucleotide composition statistics from mitochondrial or nuclear genomes.
It supports both FASTA and GenBank input, and can process single files or entire directories recursively.

Features

  • Counts A, C, G, T, ambiguous IUPAC bases, and N.
  • Calculates:
    • %A, %C, %G, %T, %N, %Ambiguities
    • %AT, %GC
    • AT-skew = (A−T)/(A+T)
    • GC-skew = (G−C)/(G+C)
  • Option to exclude ambiguities from denominator (--skip-ambig).
  • Supports FASTA (.fa, .fasta, .fna, .ffn) and GenBank (.gb, .gbk, .genbank).
  • Recursive directory search for batch processing.
  • Optional GC% sliding window profiles (--win, --step), saved as per-sequence CSV files.
  • Output as tab-delimited (TSV) table, easy to import into R, Python, Excel.

Installation

Requires Python 3.8+ and Biopython:

pip install biopython

Usage

Single FASTA file

python calc_stats.py genome.fa --out-tsv results/stats.tsv --skip-ambig

Directory with multiple FASTAs (recursive)

python calc_stats.py data/ --out-tsv results/all_stats.tsv --skip-ambig

With GC sliding windows

python calc_stats.py data/ --out-tsv results/stats.tsv --skip-ambig --win 200 --step 50

This creates additional files like seq1_gc_windows.csv with local GC% profiles.

Output columns (TSV)

  • input_path: source file
  • record_id: sequence identifier
  • name: record id + description
  • length: sequence length
  • A, C, G, T, N, ambiguous: counts
  • pct_A, pct_C, pct_G, pct_T, pct_N, pct_ambiguous: percentages
  • pct_AT, pct_GC: AT/GC content
  • AT_skew, GC_skew: skew indices

Example output

input_path   record_id   name   length   A   C   G   T   N   ambiguous   pct_A   pct_C   pct_G   pct_T   pct_N   pct_ambiguous   pct_AT   pct_GC   AT_skew   GC_skew
genome.fa    seq1        seq1   16569    5100 2800 2700 4969 0   0   30.8   16.9   16.3   30.0   0.0   0.0   60.8   33.2   0.013   -0.018

Citation

If you use this script in research, please cite this repository and acknowledge Biopython.


extract_dloop.py

extract_dloop.py extracts the mitochondrial control region (D-loop) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, supports circular genomes, and provides robust detection heuristics.


Key features

  • Robust detection of D-loop: matches D_loop, D-loop, control_region, or misc_feature with note mentioning control/D-loop.
  • FASTA + GFF3 support (auto-pair by basename via --gff-dir, or specify with --gff).
  • Heuristic fallback: extracts intergenic region between tRNA-Pro and tRNA-Phe when no explicit D-loop feature exists.
  • Circular-aware extraction: --circular allows intervals that wrap around the origin.
  • Policies: --fail-policy skip|empty|error and --multi best|keep for multiple candidates.
  • QC filters: --min-len / --max-len.
  • Outputs: per-record FASTA, optional combined FASTA, optional BED, and JSON sidecars (coords, strand, length, source).

Installation

pip install biopython

Usage

Single GenBank

python extract_dloop.py genome.gb --out-dir results/dloop --circular

Directory with mixed inputs (recursive)

python extract_dloop.py data/ --out-dir results/dloop --circular

FASTA + GFF3 (explicit file)

python extract_dloop.py sample.fasta --gff sample.gff3 --out-dir results/dloop

FASTA + GFF3 (auto-match by basename in a directory)

python extract_dloop.py genomes/ --gff-dir annotations/ --out-dir results/dloop --circular

Combined FASTA and BED export

python extract_dloop.py genomes/   --out-dir out/dloop   --combine-out out/dloop_all.fasta   --bed out/dloop.bed   --circular

Options

  • inputs: one or more files or directories (recursive).
  • --out-dir: directory for per-record FASTA/JSON outputs (required).
  • --combine-out: optional path to a combined FASTA with all extracted D-loops.
  • --gff: path to a single GFF3 file (for a single FASTA).
  • --gff-dir: directory used to auto-match *.gff/*.gff3 by basename for FASTAs.
  • --circular: treat sequences as circular; allows wrap-around extraction.
  • --fail-policy {skip,empty,error}: behavior when D-loop is not found (default: skip).
  • --multi {best,keep}: keep all candidates (keep) or only the best (best, default).
  • --min-len / --max-len: filter sequences by length (0 disables).
  • --bed: optional BED file to append intervals (wrap-around not split).
  • --log-level: logging level (INFO, DEBUG, etc.).

Output

  • out_dir/<record_id>_dloop.fasta (or _dloop_heur.fasta when heuristic is used)
  • out_dir/<record_id>_dloop.json with metadata
  • --combine-out → single FASTA with all extracted D-loops
  • --bed → BED file with intervals

FASTA header example:

>NC_XXXX | dloop | coords=15432..16569 | strand=+ | source=genome.gb
AACCTTG...

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "coords": [15432, 16569],
  "strand": "+",
  "length": 1138,
  "rank": 1,
  "type": "control_region",
  "note": "putative control region"
}

Notes & caveats

  • For FASTA without annotations, use --gff/--gff-dir or rely on the Pro↔Phe heuristic.
  • BED cannot represent wrap-around intervals in a single line; the script writes the raw [start, end) as-is.
  • For production-grade GFF handling (phase, attributes), consider gffutils.
    The current parser is minimal and focused on D-loop/tRNA features.

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.


extract_pcgs.py

extract_pcgs.py extracts mitochondrial protein-coding genes (PCGs) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, and can emit DNA or translated proteins.
It also supports concatenating all PCGs per record.

Key features

  • CDS-aware extraction from GenBank (features of type CDS) or GFF3 (entries with type=CDS).
  • Groups multiple CDS by Parent/ID (GFF3) or by gene/locus_tag (GenBank).
  • Per-gene FASTA output (default) or a single concatenated sequence (--mode concat) in genomic order.
  • Optional protein translation (--protein) with configurable NCBI translation table (--transl-table, default 2: Vertebrate Mitochondrial).
  • Circular-aware extraction (wrap-around exons).
  • Optional combined FASTA, BED with exon intervals, and JSON sidecars with metadata.
  • Filters by length (--min-len, --max-len).

Installation

pip install biopython

Usage

Single GenBank

python extract_pcgs.py genome.gb --out-dir results/pcgs --protein --transl-table 2

Directory with mixed inputs (recursive)

python extract_pcgs.py data/ --out-dir results/pcgs

FASTA + GFF3 (explicit file)

python extract_pcgs.py sample.fasta --gff sample.gff3 --out-dir results/pcgs --mode per-gene

FASTA + GFF3 (auto-match by basename in a directory)

python extract_pcgs.py genomes/ --gff-dir annotations/ --out-dir results/pcgs --mode concat --protein

Combined FASTA and BED export

python extract_pcgs.py genomes/   --out-dir out/pcgs   --combine-out out/pcgs_all.fasta   --bed out/pcgs_exons.bed   --protein --transl-table 2

Options

  • inputs: one or more files or directories (recursive).
  • --out-dir: directory for per-gene FASTA/JSON (required).
  • --combine-out: optional path to a combined FASTA.
  • --gff: path to a single GFF3 file (for a single FASTA).
  • --gff-dir: directory used to auto-match *.gff/*.gff3 by basename.
  • --circular: treat sequences as circular (wrap-around exons).
  • --protein: output proteins (AA) instead of DNA.
  • --transl-table: NCBI translation table (default 2 = Vertebrate Mitochondrial).
  • --mode {per-gene,concat}: emit per gene (default) or concatenated PCGs per record.
  • --min-len / --max-len: filter sequences by length (0 disables).
  • --bed: optional BED file with exon intervals.
  • --log-level: logging level.

Output

  • out_dir/<record_id>_<gene_id>.fna (DNA) or .faa (protein).
  • out_dir/<record_id>_<gene_id>.json metadata (coords, strand, exon list, product).
  • --mode concat<record_id>_PCGs_concat.fna/.faa + .json.
  • --combine-out → combined FASTA with all outputs.
  • --bed → BED file with exon intervals (0-based, end-exclusive).

FASTA header example:

>NC_XXXX | ND2 | product=NADH dehydrogenase subunit 2 | exons=1 | source=genome.gb
ATG...

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "gene_id": "ND2",
  "product": "NADH dehydrogenase subunit 2",
  "protein": false,
  "transl_table": 2,
  "length": 1047,
  "exons": [[4586, 5633, "+"]]
}

Notes & caveats

  • The GFF3 parser is minimal, focused on CDS records. For complex eukaryotic models, consider gffutils.
  • Translation stops at the first stop codon (to_stop=True). For quality control, inspect frames and partials.
  • Mitogenomes typically have single-exon CDS, but the code supports multi-exon models and strand handling.
  • For concatenation, genes are ordered by the start of their first exon.

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.


extract_rrna.py

extract_rrna.py extracts mitochondrial rRNAs (12S and 16S) from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, includes robust matching logic, and supports circular genomes.

Key features

  • Detects rRNAs from GenBank (feature.type == 'rRNA') or GFF3 (type=rRNA).
  • Robust matching by product/note/gene/Name/ID containing 12S or 16S (also accepts common synonyms such as ssu/lsu, rrnS/rrnL).
  • Circular-aware extraction (wrap-around intervals).
  • Fail policies: --fail-policy skip|empty|error.
  • Optional fallback by FASTA header (--fasta-header-fallback) when no features exist.
  • QC filters: --min-len / --max-len.
  • Outputs: per-rRNA FASTA, optional combined FASTA, optional BED, and JSON sidecars with metadata.

Installation

pip install biopython

Usage

Single GenBank

python extract_rrna.py genome.gb --out-dir results/rrna

Directory with mixed inputs (recursive)

python extract_rrna.py data/ --out-dir results/rrna --circular

FASTA + GFF3 (explicit file)

python extract_rrna.py sample.fasta --gff sample.gff3 --out-dir results/rrna

FASTA + GFF3 (auto-match by basename in a directory)

python extract_rrna.py genomes/ --gff-dir annotations/ --out-dir results/rrna --circular

Combined FASTA and BED export

python extract_rrna.py genomes/   --out-dir out/rrna   --combine-out out/rrna_all.fasta   --bed out/rrna_regions.bed   --circular

Options

  • inputs: one or more files or directories (recursive).
  • --out-dir: directory for per-record FASTA/JSON outputs (required).
  • --combine-out: optional path to a combined FASTA.
  • --gff: path to a single GFF3 file (for a single FASTA).
  • --gff-dir: directory used to auto-match *.gff/*.gff3 by basename.
  • --circular: treat sequences as circular; allows wrap-around extraction.
  • --only {12S,16S}: restrict extraction to one rRNA.
  • --min-len / --max-len: filter sequences by length (0 disables).
  • --bed: optional BED file for intervals.
  • --fasta-header-fallback: if no features, try to detect 12S/16S from FASTA headers (use with caution).
  • --fail-policy {skip,empty,error}: behavior when targets are missing.
  • --log-level: logging level.

Output

  • out_dir/<record_id>_12S.fna (or _16S.fna), per record.
  • out_dir/<record_id>_<tag>.json sidecar with metadata (coords, strand, length, label, notes).
  • --combine-out → combined FASTA.
  • --bed → BED file with intervals (0-based, end-exclusive).

FASTA header example:

>NC_XXXX | 12S | coords=1481..2463 | strand=+ | source=genome.gb
ATG...

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "coords": [1481, 2463],
  "strand": "+",
  "length": 983,
  "label": "12S",
  "type": "rRNA",
  "note": "12S ribosomal RNA"
}

Notes & caveats

  • The GFF3 parser is minimal, focused on rRNA records and common attributes.
  • FASTA header fallback is best-effort; prefer annotated inputs (GenBank or GFF3).
  • For wrap-around intervals, BED cannot represent a single row split; the script writes raw [start,end).

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.


extract_trna.py

extract_trna.py extracts mitochondrial tRNAs from GenBank or FASTA+GFF3 inputs.
It works on a single file or recursively over directories, labels each tRNA (e.g., tRNA-Phe/trnF), and is aware of circular genomes.

Key features

  • Detects tRNAs from GenBank (feature.type == 'tRNA') or GFF3 (type=tRNA).
  • Robust labeling from product / note / gene / Name / ID (supports formats tRNA-Phe, trnF, (Phe)).
  • Circular-aware extraction (wrap-around intervals).
  • Fail policies: --fail-policy skip|empty|error.
  • QC filters: --min-len / --max-len.
  • Outputs: per-tRNA FASTA, optional combined FASTA, optional BED, and JSON sidecars with metadata.

Installation

pip install biopython

Usage

Single GenBank

python extract_trna.py genome.gb --out-dir results/trna

Directory with mixed inputs (recursive)

python extract_trna.py data/ --out-dir results/trna --circular

FASTA + GFF3 (explicit file)

python extract_trna.py sample.fasta --gff sample.gff3 --out-dir results/trna

FASTA + GFF3 (auto-match by basename)

python extract_trna.py genomes/ --gff-dir annotations/ --out-dir results/trna --circular

Combined FASTA and BED export

python extract_trna.py genomes/   --out-dir out/trna   --combine-out out/trna_all.fasta   --bed out/trna_regions.bed   --circular

Options

  • inputs: one or more files or directories (recursive).
  • --out-dir: directory for per-record FASTA/JSON outputs (required).
  • --combine-out: optional path to a combined FASTA.
  • --gff: path to a single GFF3 file (for a single FASTA).
  • --gff-dir: directory used to auto-match *.gff/*.gff3 by basename.
  • --circular: treat sequences as circular; allows wrap-around extraction.
  • --only: restrict extraction to a specific tRNA label (e.g., Phe, Pro, tRNA-Phe, trnF).
  • --min-len / --max-len: filter sequences by length (0 disables).
  • --bed: optional BED file for intervals.
  • --fail-policy {skip,empty,error}: behavior when nothing is found.
  • --log-level: logging level.

Output

  • out_dir/<record_id>_tRNA-XXX.fna (one file per tRNA).
  • out_dir/<record_id>_tRNA-XXX.json sidecar with metadata (coords, strand, label).
  • --combine-out → combined FASTA.
  • --bed → BED file with intervals (0-based, end-exclusive).

FASTA header example:

>NC_XXXX | tRNA-Phe | coords=521..589 | strand=+ | source=genome.gb

JSON sidecar example:

{
  "record_id": "NC_XXXX",
  "source": "genome.gb",
  "coords": [521, 589],
  "strand": "+",
  "length": 69,
  "label": "tRNA-Phe",
  "type": "tRNA",
  "note": "tRNA-Phe (trnF)"
}

Notes & caveats

  • The GFF3 parser is minimal, focused on tRNA records and common attributes.
  • Label inference uses heuristics; verify naming conventions in heterogeneous annotations.
  • BED cannot represent wrap-around intervals in a single line; the script writes raw [start,end).

Citation

If this tool contributes to your research, please cite this repository and acknowledge Biopython.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages