# Ray-finned Fish species tree statistics and scaffold

Dimensionality Reduction/clustering/etc on the full fish dataset is very messy. It does
not appear to have the same issues as the insect dataset (very early diverging lineages,
as in Archaeognatha) but it's still a bit of a mess (as expected).

Here we sub-sample on an order-by-order basis to make an order-level scaffold. We will
then run MDS (or whatever) in 2D on each group, then graft those onto the order-level
scaffold tree.

In [1]:
import ete3
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.manifold import MDS

In [2]:
# Distance matrix. This is in a separate cell because it takes a little while to
# load and doesn't need to be reloaded every time. (Doesn't change.)
species_distance_matrix = pd.read_csv("output/Actinopterygii_tree_distance_matrix_py.csv", index_col=0)


In [3]:

# Taxonomy info.
genus_order_family_taxon_lookup = pd.read_csv("output/Actinopterygii_genus_order_family_taxon.csv")

# The tree. This has the species as leaves and orders are named internal nodes.
species_tree_with_orders = ete3.Tree("output/Actinopterygii_species_with_order.nwk", format=1)

In [4]:
# Let's verify a few things. 
#  The genus lookup table should match the tree. How many in each?

num_leaves = len(species_tree_with_orders.get_leaves())
print(f"{num_leaves} leaves in the genus tree.")

taxa_in_lookup_table = genus_order_family_taxon_lookup.shape[0]
print(f"{taxa_in_lookup_table} taxa in the lookup table.")

15173 leaves in the genus tree.
15173 taxa in the lookup table.


In [5]:
# Make a list of the fish orders. Use the taxa lookup table to get the unique orders.
fish_orders = genus_order_family_taxon_lookup["order"].unique().tolist()
print(f"{len(fish_orders)} unique fish orders in the lookup table.")

51 unique fish orders in the lookup table.


## Some basic counts

How many species and families? Which families have the most genera? Which orders have the most families, etc?

In [6]:
# Let's first make a list of all the orders.
all_orders = genus_order_family_taxon_lookup['order'].unique()

# Now for each order, let's make a dictionary where the key is the order, and the value is a set of all families in that order.
# We don't want a list because there are many repeats, of course.
order_to_families = {}

# Iterate through the genus_order_family_taxon_lookup list and sort families into their respective orders
for index, row in genus_order_family_taxon_lookup.iterrows():
    order = row['order']
    family = row['family']
    if order not in order_to_families:
        order_to_families[order] = set()
    order_to_families[order].add(family)

# Let's print it.
for order, families in order_to_families.items():
    print(f"{order:<20}  {len(families):<5}")


Lepisosteiformes      1    
Amiiformes            1    
Osmeriformes          11   
Argentiniformes       2    
Esociformes           2    
Salmoniformes         1    
Lampriformes          6    
Ophidiiformes         6    
Perciformes           163  
Acanthuriformes       2    
Spariformes           1    
Tetraodontiformes     10   
Lophiiformes          16   
Centrarchiformes      1    
Scorpaeniformes       32   
Gasterosteiformes     5    
Synbranchiformes      3    
Pleuronectiformes     11   
Cichliformes          1    
Atheriniformes        9    
Beloniformes          6    
Cyprinodontiformes    11   
Gobiesociformes       1    
Mugiliformes          1    
Syngnathiformes       5    
Batrachoidiformes     1    
Beryciformes          6    
Cetomimiformes        3    
Stephanoberyciformes  2    
Gadiformes            10   
Zeiformes             6    
Percopsiformes        3    
Polymixiiformes       1    
Myctophiformes        2    
Aulopiformes          15   
Ateleopodiformes    

## Random sampling

Let's grab a random genus from every order and make a distance matrix. Then we'll run MDS on this and see what it looks like in a sphere with the tree overlaid on it.

In [7]:
# Use the genus lookup table to grab one taxon from each order at random.
random_taxa = []
for order in all_orders:
    species_in_order = genus_order_family_taxon_lookup[genus_order_family_taxon_lookup['order'] == order]['taxon'].values
    if len(species_in_order) > 0:
        random_taxon = np.random.choice(species_in_order)
        random_taxa.append(random_taxon)
        
random_taxa_distance_matrix = species_distance_matrix.loc[random_taxa, random_taxa]

mds = MDS(n_components=3 , max_iter=4000 , eps = 10**-6, dissimilarity='precomputed', n_jobs=-1, verbose=10)
random_taxa_mds_coords = mds.fit(random_taxa_distance_matrix)
random_taxa_mds_df = pd.DataFrame(random_taxa_mds_coords.embedding_, index=random_taxa, columns=['x', 'y', 'z'])
random_taxa_mds_df.index.name = 'taxon'


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed:    1.7s remaining:    1.7s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    1.7s finished


In [8]:
# Add a column for the order.
random_taxa_mds_df['order'] = random_taxa_mds_df.index.map(lambda g: genus_order_family_taxon_lookup[genus_order_family_taxon_lookup['taxon'] == g]['order'].values[0])

# Reorder the columns so order is first.
random_taxa_mds_df = random_taxa_mds_df[['order', 'x', 'y', 'z']]

# Save it out as a CSV file so we can run Wandrille's script on it to put in branches.
random_taxa_mds_df.to_csv("output/random_taxa_mds_coords.csv", index=False)

%run ./integrate_tree_to_XYZ/integrate_tree_to_XYZ.py -i output/random_taxa_mds_coords.csv -t "output/Actinopterygii_order_level.nwk" -o "output/random_taxa_mds" --ignore-missing --use-z-from-file

random_taxa_mds_branches = pd.read_csv("output/random_taxa_mds.branches.csv")
Xb = []
Yb = []
Zb = []

for i,row in random_taxa_mds_branches.iterrows():
    Xb += [ row.x0 , row.x1 , None ]
    Yb += [ row.y0 , row.y1 , None ]
    Zb += [ row.z0 , row.z1 , None ]

# Plot it and use the order as the color and label.
fig = px.scatter_3d(random_taxa_mds_df, x='x', y='y', z='z', color='order', text='order')
fig.add_trace(go.Scatter3d(x=Xb, y=Yb, z=Zb, mode='lines'))
fig.update_layout(height=800, width=800)
# Make the backgroud planes invisible.
fig.update_scenes(xaxis_visible=False, yaxis_visible=False, zaxis_visible=False)
fig.update_traces(marker=dict(size=5), textposition='top center')

writing leaves coordinates in output/random_taxa_mds.leaves.csv
writing internal nodes coordinates in output/random_taxa_mds.internal.csv
writing branch coordinates in output/random_taxa_mds.branches.csv


## Sphere-izing the points

In the end, these points should all be on the surface of a sphere with radius 1.0.

In [None]:
# Sphere-ize the points. We want them all to be on the surface of a sphere with radius 1.0.
radii = np.sqrt(random_taxa_mds_df['x']**2 + random_taxa_mds_df['y']**2 + random_taxa_mds_df['z']**2)
random_taxa_mds_df['x'] = random_taxa_mds_df['x'] / radii
random_taxa_mds_df['y'] = random_taxa_mds_df['y'] / radii
random_taxa_mds_df['z'] = random_taxa_mds_df['z'] / radii

# Save it out as a CSV file so we can run Wandrille's script on it to put in branches.
random_taxa_mds_df.to_csv("output/random_taxa_mds_coords_norm_on_sphere.csv", index=False)

%run ./integrate_tree_to_XYZ/integrate_tree_to_XYZ.py -i output/random_taxa_mds_coords_norm_on_sphere.csv -t "output/Actinopterygii_order_level.nwk" -o "output/random_taxa_mds_norm_on_sphere" --ignore-missing --use-z-from-file

random_taxa_mds_branches = pd.read_csv("output/random_taxa_mds_norm_on_sphere.branches.csv")
Xb = []
Yb = []
Zb = []

for i,row in random_taxa_mds_branches.iterrows():
    Xb += [ row.x0 , row.x1 , None ]
    Yb += [ row.y0 , row.y1 , None ]
    Zb += [ row.z0 , row.z1 , None ]

# Plot it and use the order as the color and label.
fig = px.scatter_3d(random_taxa_mds_df, x='x', y='y', z='z', color='order', text='order')
fig.add_trace(go.Scatter3d(x=Xb, y=Yb, z=Zb, mode='lines'))
fig.update_layout(height=800, width=800)
# Make the backgroud planes invisible.
fig.update_scenes(xaxis_visible=False, yaxis_visible=False, zaxis_visible=False)
fig.update_traces(marker=dict(size=5), textposition='top center')

writing leaves coordinates in output/random_taxa_mds_norm_on_sphere.leaves.csv
writing internal nodes coordinates in output/random_taxa_mds_norm_on_sphere.internal.csv
writing branch coordinates in output/random_taxa_mds_norm_on_sphere.branches.csv
