A Julia package for phylogenetic analysis and cladistics - the study of evolutionary relationships among organisms.
Cladistics is a method of biological classification that groups organisms based on their evolutionary ancestry and shared derived characteristics (synapomorphies). This package provides computational tools for reconstructing and analyzing phylogenetic trees from molecular sequence data and morphological characters.
Cladistics revolutionized biological classification by emphasizing evolutionary relationships over superficial similarities. Key concepts:
- Clade: A monophyletic group consisting of an ancestor and all its descendants
- Synapomorphy: A shared derived characteristic that defines a clade
- Phylogenetic Tree: A branching diagram showing evolutionary relationships
- Parsimony: The principle that the simplest evolutionary explanation (fewest changes) is preferred
- Bootstrap Analysis: Statistical method to assess confidence in tree topology
- Multiple Distance Metrics:
- Hamming distance (simple count of differences)
- p-distance (proportion of differences)
- Jukes-Cantor 1969 (corrects for multiple substitutions)
- Kimura 2-parameter (distinguishes transitions vs transversions)
- UPGMA: Unweighted Pair Group Method with Arithmetic Mean (assumes molecular clock)
- Neighbor-Joining: Does not assume molecular clock, handles rate variation
- Maximum Parsimony: Find trees requiring fewest evolutionary changes
- Fitch Algorithm: Efficient parsimony score calculation
- Parsimony-Informative Sites: Identify characters useful for phylogenetic inference
- Bootstrap Support: Assess confidence in tree topology (1000+ replicates)
- Clade Identification: Extract well-supported monophyletic groups
- Robinson-Foulds Distance: Compare tree topologies quantitatively
- Tree Rooting: Root unrooted trees using outgroup taxa
- Newick Format: Export trees in standard phylogenetic format
using Pkg
Pkg.add("Cladistics")using Cladistics
using Plots
# 1. DNA sequence data (aligned)
sequences = [
"ATCGATCGATCG", # Species A
"ATCGATCGATCG", # Species B (same as A)
"ATCGTTCGATCG", # Species C
"TTCGATCGATCG", # Species D
"TTCGTTCGATCC" # Species E (outgroup)
]
taxa_names = ["Species_A", "Species_B", "Species_C", "Species_D", "Species_E"]
# 2. Calculate evolutionary distances
dmat = distance_matrix(sequences, method=:jc69)
println("Distance matrix:")
display(dmat)
# 3. Build phylogenetic tree using UPGMA
upgma_tree = upgma(dmat, taxa_names=taxa_names)
newick = tree_to_newick(upgma_tree)
println("\nUPGMA tree (Newick): $newick")
# 4. Build tree using Neighbor-Joining
nj_tree = neighbor_joining(dmat, taxa_names=taxa_names)
# 5. Compare tree topologies
rf_distance = tree_distance(upgma_tree, nj_tree)
println("Robinson-Foulds distance: $rf_distance")
# 6. Bootstrap analysis for confidence
support = bootstrap_support(sequences, replicates=1000, method=:nj)
println("\nBootstrap support values:")
for (clade, value) in support
println(" Clade $(collect(clade)): $(round(value*100, digits=1))%")
end
# 7. Identify well-supported clades
clades = identify_clades(upgma_tree, 0.95)
println("\nWell-supported clades (>95%):")
for clade in clades
println(" ", clade)
endDifferent evolutionary models for different scenarios:
using Cladistics
sequences = ["ATCG", "ATCG", "TTCG", "TTCC"]
# Simple counting (good for very similar sequences)
hamming = distance_matrix(sequences, method=:hamming)
# Proportion of differences (normalized)
p_dist = distance_matrix(sequences, method=:p_distance)
# Jukes-Cantor (corrects for multiple substitutions at same site)
jc69 = distance_matrix(sequences, method=:jc69)
# Kimura 2-parameter (accounts for transition/transversion bias)
k2p = distance_matrix(sequences, method=:k2p)
println("Hamming distance [1,3]: ", hamming[1,3])
println("p-distance [1,3]: ", p_dist[1,3])
println("Jukes-Cantor [1,3]: ", jc69[1,3])
println("Kimura 2P [1,3]: ", k2p[1,3])using Cladistics
# UPGMA assumes constant evolutionary rate (molecular clock)
# Produces ultrametric tree (all tips equidistant from root)
dmat = [0.0 0.2 0.4;
0.2 0.0 0.3;
0.4 0.3 0.0]
tree = upgma(dmat, taxa_names=["Human", "Chimp", "Gorilla"])
# Good for:
# - Species with similar generation times
# - Recent divergences
# - Molecular clock hypothesis valid
# Not good for:
# - Ancient divergences
# - Variable evolutionary rates
# - Rapid radiationsusing Cladistics
# NJ handles rate variation across lineages
# More sophisticated than UPGMA
dmat = [0.0 0.5 0.8 1.0;
0.5 0.0 0.6 0.9;
0.8 0.6 0.0 0.4;
1.0 0.9 0.4 0.0]
tree = neighbor_joining(dmat, taxa_names=["A", "B", "C", "D"])
# Good for:
# - Variable evolutionary rates
# - Large datasets
# - When molecular clock violated
# Produces minimum evolution treeAssess confidence in your phylogenetic tree:
using Cladistics
using Random
Random.seed!(42) # Reproducible results
# Real mitochondrial DNA sequences (example)
sequences = [
"ATCGATCGATCGATCG",
"ATCGATCGATCGATCG", # Identical to first
"ATCGTTCGATCGATCG",
"TTCGATCGATCGATCG",
"TTCGTTCGATCGTTCC"
]
# Perform 1000 bootstrap replicates
# Resamples alignment columns with replacement
support = bootstrap_support(sequences, replicates=1000, method=:nj)
# Interpret support values:
# >95%: Strong support (publish with confidence)
# 70-95%: Moderate support (mention uncertainty)
# <70%: Weak support (tree topology unreliable)
for (clade, value) in sort(collect(support), by=x->x[2], rev=true)
taxa = collect(clade)
percent = round(value * 100, digits=1)
if percent >= 95
confidence = "Strong"
elseif percent >= 70
confidence = "Moderate"
else
confidence = "Weak"
end
println("Clade $taxa: $percent% ($confidence)")
endFind trees requiring the fewest evolutionary changes:
using Cladistics
# Aligned DNA sequences
alignment = [
"ATCG",
"ATCG",
"TTCG",
"TTCC"
]
# Convert to character matrix
char_matrix = character_state_matrix(alignment)
# Find parsimony-informative sites
# (sites where at least 2 states each appear in ≥2 taxa)
informative_sites = parsimony_informative_sites(char_matrix)
println("Parsimony-informative sites: $informative_sites")
# Build tree
dmat = distance_matrix(alignment, method=:hamming)
tree = upgma(dmat, taxa_names=["Tax1", "Tax2", "Tax3", "Tax4"])
# Calculate parsimony score (total number of changes)
score = calculate_parsimony_score(tree, char_matrix)
println("Parsimony score: $score changes")
# Lower scores = more parsimonious (preferred)using Cladistics
# Partial mitochondrial cytochrome b sequences (simplified)
primate_sequences = [
"ATCGATCGATCGATCGATCG", # Human
"ATCGATCGATCGATCGATCG", # Chimp (very similar to human)
"ATCGATCGTTCGATCGATCG", # Gorilla
"ATCGTTCGTTCGATCGTTCG", # Orangutan
"TTCGTTCGTTCGTTCGTTCG" # Lemur (outgroup)
]
taxa = ["Human", "Chimp", "Gorilla", "Orangutan", "Lemur"]
# Use Kimura 2-parameter (best for DNA)
dmat = distance_matrix(primate_sequences, method=:k2p)
# Build NJ tree (no molecular clock assumption)
tree = neighbor_joining(dmat, taxa_names=taxa)
# Root on lemur (outgroup)
rooted_tree = root_tree(tree, "Lemur")
# Bootstrap confidence
support = bootstrap_support(primate_sequences, replicates=100, method=:nj)
# Export for visualization
newick = tree_to_newick(rooted_tree)
println("Newick format: $newick")
# Can import into FigTree, iTOL, or other phylogenetic viewersusing Cladistics
# Morphological character matrix
# Characters: [wings, feathers, warm-blooded, lays eggs]
# 0 = absent, 1 = present
taxa_morphology = [
"AAAA", # Lizard (outgroup)
"AABB", # Crocodile
"BBBB", # Chicken
"BBBB", # Eagle
"BAAB" # Bat
]
taxa_names = ["Lizard", "Crocodile", "Chicken", "Eagle", "Bat"]
# Build tree based on characters
dmat = distance_matrix(taxa_morphology, method=:hamming)
tree = upgma(dmat, taxa_names=taxa_names)
# Analyze character evolution
char_matrix = character_state_matrix(taxa_morphology)
parsimony_score = calculate_parsimony_score(tree, char_matrix)
println("Minimum evolutionary changes: $parsimony_score")
# Low score = good fit between characters and tree topologyusing Cladistics
sequences = ["ATCG", "ATCG", "TTCG", "GTCC"]
# Hypothesis 1: UPGMA tree (assumes molecular clock)
dmat = distance_matrix(sequences, method=:jc69)
tree1 = upgma(dmat, taxa_names=["A", "B", "C", "D"])
# Hypothesis 2: Neighbor-Joining tree (no clock)
tree2 = neighbor_joining(dmat, taxa_names=["A", "B", "C", "D"])
# Compare topologies
rf_dist = tree_distance(tree1, tree2)
if rf_dist == 0
println("Trees have identical topology")
else
println("Trees differ by $rf_dist bipartitions")
end
# Calculate parsimony scores for both
char_matrix = character_state_matrix(sequences)
score1 = calculate_parsimony_score(tree1, char_matrix)
score2 = calculate_parsimony_score(tree2, char_matrix)
println("UPGMA parsimony score: $score1")
println("NJ parsimony score: $score2")
# Prefer tree with lower parsimony score (fewer changes)- Assumption: Evolutionary rate is constant across lineages
- When valid: Recent species, similar generation times
- Methods: UPGMA assumes molecular clock
- When violated: Use Neighbor-Joining instead
- Purpose: Assess confidence in tree branches
- Method: Resample alignment columns with replacement
- Interpretation:
-
95%: Publish with confidence
- 70-95%: Mention uncertainty
- <70%: Weak support, collect more data
-
- Parsimony: Find tree requiring fewest character changes
- Good: Morphological data, theoretical clarity
- Bad: Computationally expensive (NP-hard)
- Distance: Build tree from pairwise distances
- Good: Fast, scales to large datasets
- Bad: Information loss from pairwise comparisons
- Felsenstein, J. (1985). "Confidence limits on phylogenies: An approach using the bootstrap." Evolution, 39(4), 783-791.
- Saitou, N., & Nei, M. (1987). "The neighbor-joining method: A new method for reconstructing phylogenetic trees." Molecular Biology and Evolution, 4(4), 406-425.
- Fitch, W. M. (1971). "Toward defining the course of evolution: Minimum change for a specific tree topology." Systematic Zoology, 20(4), 406-416.
- Felsenstein, J. (2004). Inferring Phylogenies. Sinauer Associates.
- Lemey, P., Salemi, M., & Vandamme, A. M. (2009). The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge University Press.
- Hall, B. G. (2011). Phylogenetic Trees Made Easy: A How-To Manual. Sinauer Associates.
- Jukes, T. H., & Cantor, C. R. (1969). "Evolution of protein molecules." In Mammalian Protein Metabolism, pp. 21-132.
- Kimura, M. (1980). "A simple method for estimating evolutionary rates of base substitutions." Journal of Molecular Evolution, 16(2), 111-120.
If you use this package in research, please cite:
@software{cladistics_jl,
author = {Jewell, Jonathan D.A.},
title = {Cladistics.jl: Phylogenetic Analysis in Julia},
year = {2026},
url = {https://github.com/hyperpolymath/Cladistics.jl}
}- BioJulia - Broader bioinformatics ecosystem
- PhyloNetworks.jl - Phylogenetic networks
- Phylo.jl - Alternative phylogenetics package
Visualize Newick trees with:
See CONTRIBUTING.md for guidelines.
This project is licensed under the Palimpsest License (PMPL-1.0-or-later). See LICENSE for details.