# Ray-finned fish tree from TimeTree - species level (Actinopterygii)

The goal of this notebook is to produce a clean species tree that is annotated
with orders and families in internal node names. It also produces an order-level
tree with the orders as the leaves that will serve as a scaffold in successive
notebooks.

To accomplish this, we need to look up the order and family for every species in
the tree. This requires some cleaning of the tree, which is removing taxa that
cannot be looked up and fixing node names so lookups can be done properly. It
also involves cleaning of the database, which is an iterative process. The final
version of this notebook is after all this has been done. The taxonomy database
is saved locally during this process and the final db has been tweaked quite a
bit.

This script requires ETE, which normally does not install properly on Windows.
You can run it in Windows, but you have to install ETE carefully and with the
right prerequisites. You may have issues running a blanket `import ete`, as this
pulls in all kinds of things that don't work on windows. Here I only use the
`Tree` class, which appears to work.

In [1]:
# Let's use my handy dandy taxonomy database to look up order and family info.

import TaxonomyDB as taxonomydb
from importlib import reload
reload(taxonomydb)

# Get the hostname to help set the data directory correctly.
import socket
hostname = socket.gethostname()
print("Hostname:", hostname)

if hostname == "RGGS-LAP-005":
    db = taxonomydb.TaxonomyDB(data_dir="e:/NCBI_taxdump")
else:
    db = taxonomydb.TaxonomyDB(data_dir="c:/Users/hherhold/Data/NCBI_taxdump")


Hostname: AMNH-WSJ8DDFB4
Loading NCBI taxonomy data...done.


Quick database test...

In [2]:
db.get_species_info("Calamus")

Cache hit for genus 'Calamus': Order: Acanthuriformes, Family: Sparidae


('Acanthuriformes', 'Sparidae')

Database cleanup. Problematic taxa identified on previous runs. This only needs to be run once as the DB caches changes locally. Running it multiple times doesn't hurt anything, you just don't need to do it.

In [3]:
# Extra taxa not in GBIF or NCBI that we need to handle.
db.remove_species_info(genus='Calamus')
db.remove_species_info(genus='Ophidion')
db.remove_species_info(genus='Liparis')
db.remove_species_info(genus='Orestias')
db.remove_species_info(genus='Scuticaria')
db.remove_species_info(genus='Centropogon')
db.remove_species_info(genus='Exostoma')
db.remove_species_info(genus='Mora')
db.remove_species_info(genus='Bullockia')
db.remove_species_info(genus='Symphysodon')
db.remove_species_info(genus='Scleronema')
db.remove_species_info(genus='Histrio')
db.remove_species_info(genus='Nansenia')
db.remove_species_info(genus='Olyra')
db.remove_species_info(genus='Cynodon')
db.remove_species_info(genus='Bairdiella')
db.remove_species_info(genus='Ranzania')
db.remove_species_info(genus='Zeus')
db.remove_species_info(genus='Argentina')
db.remove_species_info(genus='Dactylophora')
db.remove_species_info(genus='Trachystoma')

db.add_species_info('Calamus', 'Acanthuriformes', 'Sparidae')
db.add_species_info('Ophidion', 'Ophidiiformes', 'Ophidiidae')
db.add_species_info('Liparis', 'Perciformes', 'Liparidae')
db.add_species_info('Orestias', 'Cyprinodontiformes', 'Cyprinodontidae')
db.add_species_info('Scuticaria', 'Anguilliformes', 'Muraenidae')
db.add_species_info('Centropogon', 'Perciformes', 'Syngnathidae')
db.add_species_info('Exostoma', 'Siluriformes', 'Sisoridae')
db.add_species_info('Mora', 'Gadiformes', 'Moridae')
db.add_species_info('Bullockia', 'Siluriformes', 'Tricomycteridae')
db.add_species_info('Symphysodon', 'Cichliformes', 'Cichlidae')
db.add_species_info('Scleronema', 'Siluriformes', 'Trichomycteridae')
db.add_species_info('Histrio', 'Lophiiformes', 'Antennariidae')
db.add_species_info('Nansenia', 'Argentiniformes', 'Microstomatidae')
db.add_species_info('Olyra', 'Siluriformes', 'Bagridae')
db.add_species_info('Cynodon', 'Characiformes', 'Cynodontidae')
db.add_species_info('Bairdiella', 'Acanthuriformes', 'Sciaenidae')
db.add_species_info('Ranzania', 'Tetraodontiformes', 'Molidae')
db.add_species_info('Zeus', 'Zeiformes', 'Zeidae')
db.add_species_info('Argentina', 'Argentiniformes', 'Argentinidae')
db.add_species_info('Dactylophora', 'Centrarchiformes', 'Cheilodactylidae')
db.add_species_info('Trachystoma', 'Mugiliformes', 'Mugilidae')

Removed Calamus from the database.
Removed Ophidion from the database.
Removed Liparis from the database.
Removed Orestias from the database.
Removed Scuticaria from the database.
Removed Centropogon from the database.
Removed Exostoma from the database.
Removed Mora from the database.
Removed Bullockia from the database.
Removed Symphysodon from the database.
Removed Scleronema from the database.
Removed Histrio from the database.
Removed Nansenia from the database.
Removed Olyra from the database.
Removed Cynodon from the database.
Removed Bairdiella from the database.
Removed Ranzania from the database.
Removed Zeus from the database.
Removed Argentina from the database.
Removed Dactylophora from the database.
Removed Trachystoma from the database.
Added Calamus to the database with Order: Acanthuriformes, Family: Sparidae
Added Ophidion to the database with Order: Ophidiiformes, Family: Ophidiidae
Added Liparis to the database with Order: Perciformes, Family: Liparidae
Added Oresti

In [4]:
# Load in the trees.

from ete3 import Tree

species_tree = Tree('ray-finned fishes_species.nwk', format=1, quoted_node_names=True)

# Output the number of tips in each tree.
print(f'Species tree has {len(species_tree.get_leaves())} tips.')


Species tree has 15180 tips.


## Data cleaning

Trees from timetree.org are a composite of many trees from different studies. Some of these have odd naming schemes and don't conform to "Genus species" naming like we need here for lookups. This section of code cleans up names so that lookups for family and order actually work.

Every dataset is a little different, and the code blocks below are specific to *this* dataset. Cleanup of the insect tree was quite different, as there were many misnamed genera and even some bacteria included. Oh, and some fungi, and a couple of bivalves.

In [5]:
# There are a few family names posing as genera. Remove anything that
# ends with "idae" or "inae".

import re

for leaf in species_tree.get_leaves():
    # Grab the first part of the name. It can be the first word separated by a space,
    # underscore, or dash.
    genus_name = re.split(r'[ _-]', leaf.name)[0]

    # If the genus name ends with "idae" or "inae", remove it.
    if genus_name.endswith("idae") or genus_name.endswith("inae"):
        print(f"Removing family name posing as genus: {genus_name}")
        species_tree.search_nodes(name=leaf.name)[0].detach()


Removing family name posing as genus: Thaumatichthyidae
Removing family name posing as genus: Ereuniidae
Removing family name posing as genus: Anomalopidae
Removing family name posing as genus: Hypostominae
Removing family name posing as genus: Neoplecostominae
Removing family name posing as genus: Neoplecostominae


In [6]:
# Are there any leaves that are just a number, or just an underscore and a number?
for leaf in species_tree.get_leaves():
    if re.match(r'^_?\d+$', leaf.name):
        print(f"Removing invalid leaf name: {leaf.name}")
        species_tree.search_nodes(name=leaf.name)[0].detach()

Removing invalid leaf name: _1895


In [7]:
# Misspellings. 

for leaf in species_tree.get_leaves():
    # Labeoarbus -> Labeobarbus
    if leaf.name.startswith("Labeoarbus"):
        print(f"Correcting misspelled genus: {leaf.name} -> Labeobarbus")
        species_tree.search_nodes(name=leaf.name)[0].name = leaf.name.replace("Labeoarbus", "Labeobarbus")

    # Tariqlabeo -> Tariqilabeo
    if leaf.name.startswith("Tariqlabeo"):
        print(f"Correcting misspelled genus: {leaf.name} -> Tariqilabeo")
        species_tree.search_nodes(name=leaf.name)[0].name = leaf.name.replace("Tariqlabeo", "Tariqilabeo")


Correcting misspelled genus: Labeoarbus_bynni_occidentalis -> Labeobarbus
Correcting misspelled genus: Tariqlabeo_bicornis -> Tariqilabeo


In [8]:
# Inactive taxa.
#Coelonotus_leiaspis should be Microphis_leiaspis
for leaf in species_tree.get_leaves():
    if leaf.name.startswith("Coelonotus_leiaspis"):
        print(f"Correcting inactive taxon: {leaf.name} -> Microphis_leiaspis")
        species_tree.search_nodes(name=leaf.name)[0].name = leaf.name.replace("Coelonotus_leiaspis", "Microphis_leiaspis")


Correcting inactive taxon: Coelonotus_leiaspis -> Microphis_leiaspis


In [11]:
# Let's make sure there are no duplicated names in the tree. Check all tips - the internal nodes are not important for this.
# Print the list of duplicated names, if any. Also print how many times each duplicated name appears.
tip_names = [leaf.name for leaf in species_tree.iter_leaves()]
duplicated_names = set([name for name in tip_names if tip_names.count(name) > 1])
print("Duplicated names and their counts:")
for name in duplicated_names:
    print(f"{name}: {tip_names.count(name)}")

Duplicated names and their counts:
Tenebrosternarchus_preto: 2


In [12]:
# There's only one, Tenebrosternarchus_preto. Get the list of all leaves with that name.
tenebrosternarchus_leaves = species_tree.search_nodes(name="Tenebrosternarchus_preto")
print(f"Found {len(tenebrosternarchus_leaves)} leaves with the name 'Tenebrosternarchus_preto'.")

# Rename them to Tenebrosternarchus_preto_1 and Tenebrosternarchus_preto_2.
count = 1
for leaf in species_tree.get_leaves():
    if leaf.name == "Tenebrosternarchus_preto":
        new_name = f"Tenebrosternarchus_preto_{count}"
        print(f"Renaming duplicate species: {leaf.name} -> {new_name}")
        species_tree.search_nodes(name=leaf.name)[0].name = new_name
        count += 1

# Check for duplicates again.
tip_names = [leaf.name for leaf in species_tree.iter_leaves()]
duplicated_names = set([name for name in tip_names if tip_names.count(name) > 1])
print("Duplicated names after renaming:", duplicated_names)

Found 2 leaves with the name 'Tenebrosternarchus_preto'.
Renaming duplicate species: Tenebrosternarchus_preto -> Tenebrosternarchus_preto_1
Duplicated names after renaming: set()


In [13]:
# A couple of names are quoted because there are spaces in the name. Remove the quotes
# and replace spaces with underscores. Also remove any commas, periods, and parenthesis, and colons.
for leaf in species_tree.get_leaves():
    cleaned_name = leaf.name.replace(" ", "_").replace(",", "").replace(".", "").replace("(", "").replace(")", "").replace(":", "")
    if cleaned_name != leaf.name:
        print(f"Cleaning leaf name: {leaf.name} -> {cleaned_name}")
        species_tree.search_nodes(name=leaf.name)[0].name = cleaned_name

Cleaning leaf name: Photopectoralis_cf._aureus_MPD-2011 -> Photopectoralis_cf_aureus_MPD-2011
Cleaning leaf name: Photopectoralis_sp._Okinawa/Taiwan -> Photopectoralis_sp_Okinawa/Taiwan
Cleaning leaf name: Leiognathus_sp._Fiji -> Leiognathus_sp_Fiji
Cleaning leaf name: Lutjanus_sp._2_PM-2004 -> Lutjanus_sp_2_PM-2004
Cleaning leaf name: Lutjanus_sp._1_PM-2004 -> Lutjanus_sp_1_PM-2004
Cleaning leaf name: Lutjanus_cf._apodus_MAR-2011 -> Lutjanus_cf_apodus_MAR-2011
Cleaning leaf name: Haemulon_sp._A_JJT-2012 -> Haemulon_sp_A_JJT-2012
Cleaning leaf name: Haemulon_sp._B_JJT-2012 -> Haemulon_sp_B_JJT-2012
Cleaning leaf name: Anisotremus_aff._interruptus_JT-2018 -> Anisotremus_aff_interruptus_JT-2018
Cleaning leaf name: Anisotremus_aff._scapularis_JT-2018 -> Anisotremus_aff_scapularis_JT-2018
Cleaning leaf name: Acanthopagrus_sp._KPM-NR0043453 -> Acanthopagrus_sp_KPM-NR0043453
Cleaning leaf name: Mola_cf._ramsayi_SAK-2005 -> Mola_cf_ramsayi_SAK-2005
Cleaning leaf name: Chaunax_sp._VAD-2015 -> 

In [14]:
# Lookups. Try to look up each genus in the taxonomy database.
import re

leaves_sorted_by_order = {}
for leaf in species_tree.get_leaves():
    # Grab the first part of the name. It can be the first word separated by a space,
    # underscore, or dash.
    genus_name = re.split(r'[ _-]', leaf.name)[0]

    # Get the order name from the genus name.
    order_name, family = db.get_species_info(genus_name)

    # Add the order and family to each leaf.
    leaf.add_feature('order', order_name)
    leaf.add_feature('family', family)

    # Make a list of all the insect orders in the tree.
    if order_name not in leaves_sorted_by_order:
        leaves_sorted_by_order[order_name] = []
        print(f"Adding new order: {order_name}")
    leaves_sorted_by_order[order_name].append(leaf)


Cache hit for genus 'Lepisosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Adding new order: Lepisosteiformes
Cache hit for genus 'Lepisosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Cache hit for genus 'Lepisosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Cache hit for genus 'Lepisosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Cache hit for genus 'Atractosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Cache hit for genus 'Atractosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Cache hit for genus 'Atractosteus': Order: Lepisosteiformes, Family: Lepisosteidae
Cache hit for genus 'Amia': Order: Amiiformes, Family: Amiidae
Adding new order: Amiiformes
Cache hit for genus 'Lepidogalaxias': Order: Osmeriformes, Family: Lepidogalaxiidae
Adding new order: Osmeriformes
Cache hit for genus 'Microstoma': Order: Osmeriformes, Family: Microstomatidae
Cache hit for genus 'Bathylagus': Order: Osmeriformes, Family: Bathylagidae
Cache hit for genus 'Pseud

In [15]:
# Let's make a set of all the orders we found. For each leaf, get the order feature
# and add it to the set

found_orders = set()
for leaf in species_tree.get_leaves():
    found_orders.add(leaf.order)
print(f"Found {len(found_orders)} unique orders in the tree: {found_orders}")
for order in sorted(found_orders):
    print(f"- {order}")

Found 51 unique orders in the tree: {'Percopsiformes', 'Stomiiformes', 'Acanthuriformes', 'Stephanoberyciformes', 'Gadiformes', 'Gonorynchiformes', 'Argentiniformes', 'Beryciformes', 'Characiformes', 'Gasterosteiformes', 'Esociformes', 'Clupeiformes', 'Amiiformes', 'Scorpaeniformes', 'Gobiesociformes', 'Elopiformes', 'Tetraodontiformes', 'Beloniformes', 'Osmeriformes', 'Cyprinodontiformes', 'Saccopharyngiformes', 'Lepisosteiformes', 'Zeiformes', 'Cetomimiformes', 'Cichliformes', 'Polymixiiformes', 'Notacanthiformes', 'Perciformes', 'Albuliformes', 'Atheriniformes', 'Ateleopodiformes', 'Pleuronectiformes', 'Syngnathiformes', 'Lophiiformes', 'Acipenseriformes', 'Mugiliformes', 'Ophidiiformes', 'Polypteriformes', 'Centrarchiformes', 'Cypriniformes', 'Siluriformes', 'Batrachoidiformes', 'Gymnotiformes', 'Osteoglossiformes', 'Salmoniformes', 'Spariformes', 'Synbranchiformes', 'Lampriformes', 'Anguilliformes', 'Myctophiformes', 'Aulopiformes'}
- Acanthuriformes
- Acipenseriformes
- Albulifor

In [16]:
# For looking up specific orders, e.g., Brassicales, that are not fish orders. These
# are fixed above in the database cleanup section.
for leaf in species_tree.get_leaves():
    if leaf.order == "Sulfolobales":
        print(f"Leaf in order Sulfolobales: {leaf.name}")

## Monophyletic groups

Every leaf has an order and family added. Let's find the order-level nodes in the tree and name those.

In [17]:
# For each insect order, find the corresponding leaves in the genus tree.
for order_name in found_orders:
    leaves_in_given_order = species_tree.search_nodes(order=order_name)
    print(f"Order {order_name} has {len(leaves_in_given_order)} leaves in the species tree.")

    # Now get the most recent common ancestor for these leaves.
    if leaves_in_given_order:
        mrca = species_tree.get_common_ancestor(leaves_in_given_order)
        print(f"The most recent common ancestor of order {order_name} is {mrca.name}.")
        mrca.name = order_name
        print(f"Renamed MRCA node to {mrca.name}.")

Order Percopsiformes has 12 leaves in the species tree.
The most recent common ancestor of order Percopsiformes is 18528.
Renamed MRCA node to Percopsiformes.
Order Stomiiformes has 103 leaves in the species tree.
The most recent common ancestor of order Stomiiformes is 19206.
Renamed MRCA node to Stomiiformes.
Order Acanthuriformes has 7 leaves in the species tree.
The most recent common ancestor of order Acanthuriformes is 2839.
Renamed MRCA node to Acanthuriformes.
Order Stephanoberyciformes has 15 leaves in the species tree.
The most recent common ancestor of order Stephanoberyciformes is 17963.
Renamed MRCA node to Stephanoberyciformes.
Order Gadiformes has 197 leaves in the species tree.
The most recent common ancestor of order Gadiformes is 18462.
Renamed MRCA node to Gadiformes.
Order Gonorynchiformes has 14 leaves in the species tree.
The most recent common ancestor of order Gonorynchiformes is 29413.
Renamed MRCA node to Gonorynchiformes.
Order Argentiniformes has 8 leaves in

In [18]:
# Finally, let's save the cleaned species tree to a file.
species_tree.write(format=1, outfile='output/Actinopterygii_species_with_order.nwk')

# Tree of orders

We do need a tree that shows the relationships between the different orders. This is used in subsequent notebooks to build a scaffold tree that will have the points grafted onto it. See the insect dataset for more info on this.

<b>IMPORTANT</b>

You *can* generate an order-level tree directly from the timetree.org site, however it usually is paraphyletic; certain orders show up multiple times in the tree and it requires extensive cleaning (and knowledge of fish phylogenetics, which I do not have). So we will use the species-level tree from timetree to generate monophyletic order-level groupings and then generate an order-level tree from there.

In [19]:
# For each order, pick one leaf at random and print the leaf name, order, and family.

# Let's start with a full copy of the original tree that we can prune.
fish_tree_order_level = species_tree.copy()

fishes_to_keep = []

import random
for order in found_orders:
    leaves_in_order = fish_tree_order_level.search_nodes(order=order)
    if leaves_in_order:
        random_leaf = random.choice(leaves_in_order)
        fishes_to_keep.append(random_leaf)

# Now prune the tree to keep only these leaves.
fish_tree_order_level.prune(fishes_to_keep, preserve_branch_length=True)

# Now change the tip names to just the order name.
for leaf in fish_tree_order_level.get_leaves():
    leaf.name = leaf.order

# Now make sure any internal nodes that have the same name as a leaf are renamed to something else.
for node in fish_tree_order_level.traverse():
    if node.is_leaf():
        continue
    if node.name in found_orders:
        node.name = f"{node.name}_internal"

print(f'Order-level tree has {len(fish_tree_order_level.get_leaves())} tips.')
print(fish_tree_order_level)

Order-level tree has 51 tips.

            /-Amiiformes
         /-|
        |   \-Lepisosteiformes
        |
        |                        /-Siluriformes
        |                     /-|
        |                  /-|   \-Characiformes
        |                 |  |
        |               /-|   \-Gymnotiformes
        |              |  |
        |            /-|   \-Cypriniformes
        |           |  |
        |         /-|   \-Gonorynchiformes
        |        |  |
        |        |   \-Clupeiformes
        |        |
        |        |         /-Esociformes
        |        |      /-|
        |        |   /-|   \-Salmoniformes
        |        |  |  |
        |        |  |   \-Argentiniformes
        |        |  |
        |        |  |                                             /-Spariformes
        |        |  |                                          /-|
        |        |  |                                         |  |   /-Tetraodontiformes
        |        |  |        

In [20]:
# Looks good, let's save the order-level tree to a file.
fish_tree_order_level.write(format=1, outfile="output/Actinopterygii_order_level.nwk")
# Let's load it back in to make sure it saved correctly.
order_level_tree = Tree("output/Actinopterygii_order_level.nwk", format=1, quoted_node_names=True)
print(order_level_tree)



            /-Amiiformes
         /-|
        |   \-Lepisosteiformes
        |
        |                        /-Siluriformes
        |                     /-|
        |                  /-|   \-Characiformes
        |                 |  |
        |               /-|   \-Gymnotiformes
        |              |  |
        |            /-|   \-Cypriniformes
        |           |  |
        |         /-|   \-Gonorynchiformes
        |        |  |
        |        |   \-Clupeiformes
        |        |
        |        |         /-Esociformes
        |        |      /-|
        |        |   /-|   \-Salmoniformes
        |        |  |  |
        |        |  |   \-Argentiniformes
        |        |  |
        |        |  |                                             /-Spariformes
        |        |  |                                          /-|
        |        |  |                                         |  |   /-Tetraodontiformes
        |        |  |                                      

Let's make a CSV file that has the following columns for use in later notebooks:

- genus
- order
- family
- taxon


In [21]:

import pandas as pd
genus_order_family_df = pd.DataFrame(columns=['genus', 'order', 'family', 'taxon'])

for leaf in species_tree.get_leaves():
    genus_name = re.split(r'[ _-]', leaf.name)[0]

    order_name = leaf.order
    family_name = db.get_species_info(genus_name, verbose=False)[1]
    genus_order_family_df.loc[len(genus_order_family_df)] = [genus_name, order_name, family_name, leaf.name]

# Save this list to a file.
genus_order_family_df.to_csv('output/Actinopterygii_genus_order_family_taxon.csv', index=False)

# Tree of families.

Let's do the same thing for families. We'll use this to build a scaffold, similar to the order-level tree, but using monophyletic families.


In [None]:
# First let's get a list of all the unique families in the tree.
unique_families = set()
for leaf in species_tree.get_leaves():
    unique_families.add(leaf.family)
print(f"Found {len(unique_families)} unique families in the tree.")
for family in sorted(unique_families):
    print(f"- {family}")

# Family-level statistics

The cells below do some histograms to get counts of taxa in families, etc. These are not needed for making OpenSpace assets, etc.

In [None]:
# Let's show a barplot of the number of leaves per order. Add the number of leaves on top of each bar.
import matplotlib.pyplot as plt
order_counts = {order: len([leaf for leaf in species_tree.get_leaves() if leaf.order == order]) for order in found_orders}
# Put these in alphabetical order.
#order_counts = dict(sorted(order_counts.items()))
plt.figure(figsize=(10, 6))
bars = plt.bar(order_counts.keys(), order_counts.values())
plt.xticks(rotation=90)
plt.xlabel('Order')
plt.ylabel('Number of Leaves')

# Add the number of leaves on top of each bar.
for bar in bars:
    yval = bar.get_height()
    #plt.text(bar.get_x() + bar.get_width()/2, yval, int(yval), va='bottom')  # va: vertical alignment
    plt.text(bar.get_x(), yval, int(yval), va='bottom')  # va: vertical alignment
plt.title('Number of Leaves per Order in Species Tree')
plt.tight_layout()
plt.show()

In [None]:

# How many unique families in each order? Let's make a barplot.
unique_families_per_order = genus_order_family_df.groupby('order')['family'].nunique()

# Change the order of the list to match the found_orders list of fish orders
unique_families_per_order = unique_families_per_order.reindex(found_orders)

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.bar(unique_families_per_order.index, unique_families_per_order.values)
plt.xticks(rotation=90)
plt.title('Number of Unique Families per Order')
plt.xlabel('Order')
plt.ylabel('Number of Unique Families')
plt.tight_layout()
plt.show()
