# Actinipterygii order-level t-SNE

This notebook performs t-SNE on an order-level basis to construct point locations. Each
group of points is then grafted onto the scaffold constructed in the previous notebook,
`130 Fish species tree scaffold.ipynb`.

The first iteration of this tried to use the tree to build a monophyletic grouping by
finding the MRCA of all the genera in an order. This does not seem to work correctly, so
we're going to use the main distance matrix constructed in 104/105 and just grab taxa
from that. 

In [3]:
# Packages.
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import plotly.express as px

In [4]:
# Load in the main distance matrix and the taxonomy list we made earlier.
dist_matrix = pd.read_csv("output/Actinopterygii_tree_distance_matrix_py.csv", index_col=0)
taxonomy_list = pd.read_csv("output/Actinopterygii_genus_order_family_taxon.csv", index_col=0)

In [5]:
# Change the index of the taxonomy_list to be the taxon name.
taxonomy_list.index = taxonomy_list['taxon']
taxonomy_list.index.name = 'taxon'

In [6]:
# Make a list of all the fish orders. At some point, it would be useful to
# have a Misof-style tree of fish orders in phylogenetic order from most
# basal to most derived, but for now we'll just use the order they appear in the
# taxonomy list.
fish_orders = taxonomy_list['order'].unique()

In [7]:
# Let's make an output dir for the order-level t-SNE results.
import pathlib

tsne_by_order_output_dir = pathlib.Path('output/tsne_by_order')

tsne_by_order_output_dir.mkdir(exist_ok=True)

## tSNE for each order

We need to run t-SNE for each order individually. Let's make a list of all the orders, then make a function that runs t-SNE on one order. We'll then loop over all the orders.

In [8]:
# Let's make a quick table with the number of taxa in each order.
order_counts = taxonomy_list['order'].value_counts().reset_index()
order_counts.columns = ['order', 'count']
order_counts = order_counts.sort_values('count', ascending=False)
print(order_counts)

                   order  count
0            Perciformes   5731
1           Siluriformes   1891
2          Cypriniformes   1878
3          Characiformes   1019
4     Cyprinodontiformes    809
5        Scorpaeniformes    547
6      Pleuronectiformes    328
7      Tetraodontiformes    267
8         Atheriniformes    250
9         Anguilliformes    228
10       Syngnathiformes    210
11            Gadiformes    197
12          Clupeiformes    185
13         Gymnotiformes    176
14          Osmeriformes    142
15          Beloniformes    131
16        Myctophiformes    130
17          Lophiiformes    113
18          Stomiiformes    103
19     Osteoglossiformes    103
20         Salmoniformes     89
21          Aulopiformes     78
22          Beryciformes     77
23          Mugiliformes     64
24         Ophidiiformes     64
25       Gobiesociformes     53
26      Synbranchiformes     49
27             Zeiformes     28
28      Acipenseriformes     26
29     Batrachoidiformes     20
30     G

In [9]:
# Pufferfishes (Tetraodontiformes) is in the middle, so we'll use that as a test.

current_order = 'Tetraodontiformes'  # Change this to process a different order

def do_tsne_for_order(current_order):

    print(f"Processing order: {current_order}")

    # Get the list of taxa in this order.
    taxa_in_order = taxonomy_list[taxonomy_list['order'] == current_order]['taxon'].tolist()
    print(f"Number of taxa in {current_order}: {len(taxa_in_order)}")

    # Now filter the distance matrix to only include those taxa.
    filtered_matrix = dist_matrix.loc[taxa_in_order, taxa_in_order]
    print(f"Filtered matrix shape for {current_order}: {filtered_matrix.shape}")

    # Now run t-SNE on this distance matrix.
    print("Running t-SNE...", end='', flush=True)
    perplexity = min(30, (len(taxa_in_order) - 1) // 3)
    if perplexity < 5:
        print(f"Not enough genera ({len(taxa_in_order)}) for t-SNE with suitable perplexity. Skipping...")
        return

    print(f"Using perplexity: {perplexity}...", end='', flush=True)
    tsne = TSNE(n_components=2, 
                perplexity=perplexity, 
                init='random',
                metric='precomputed')
    df_tsne = tsne.fit_transform(filtered_matrix)
    print("done.")

    df_tsne = pd.DataFrame( df_tsne , index = filtered_matrix.index , columns = list('xy')) 
    df_tsne.index.name = 'taxon'

    # Let's add the order and family information back in. Merge based on 'taxon' in both dataframes.
    # First make sure the index is named 'taxon' in both dataframes.
    taxonomy_list.index.name = 'taxon'
    df_tsne = df_tsne.merge(taxonomy_list[['order', 'family']], left_index=True, right_index=True)

    output_path = tsne_by_order_output_dir / f"{current_order}_2D_tSNE_sklearn.csv"

    df_tsne.to_csv(output_path)

do_tsne_for_order(current_order)

Processing order: Tetraodontiformes
Number of taxa in Tetraodontiformes: 267
Filtered matrix shape for Tetraodontiformes: (267, 267)
Running t-SNE...Using perplexity: 30...

done.


## Loop over all orders

Works, now run it on everything.

In [10]:
for current_order in fish_orders:
    do_tsne_for_order(current_order)

Processing order: Lepisosteiformes
Number of taxa in Lepisosteiformes: 7
Filtered matrix shape for Lepisosteiformes: (7, 7)
Running t-SNE...Not enough genera (7) for t-SNE with suitable perplexity. Skipping...
Processing order: Amiiformes
Number of taxa in Amiiformes: 1
Filtered matrix shape for Amiiformes: (1, 1)
Running t-SNE...Not enough genera (1) for t-SNE with suitable perplexity. Skipping...
Processing order: Osmeriformes
Number of taxa in Osmeriformes: 142
Filtered matrix shape for Osmeriformes: (142, 142)
Running t-SNE...Using perplexity: 30...done.
Processing order: Argentiniformes
Number of taxa in Argentiniformes: 8
Filtered matrix shape for Argentiniformes: (8, 8)
Running t-SNE...Not enough genera (8) for t-SNE with suitable perplexity. Skipping...
Processing order: Esociformes
Number of taxa in Esociformes: 12
Filtered matrix shape for Esociformes: (12, 12)
Running t-SNE...Not enough genera (12) for t-SNE with suitable perplexity. Skipping...
Processing order: Salmoniform

## Plotting

What do these look like? Do one at a time, selecting what we want.

In [11]:
current_order = 'Anguilliformes'  # Change this to visualize a different order

df_tsne = pd.read_csv(tsne_by_order_output_dir / f"{current_order}_2D_tSNE_sklearn.csv", index_col=0)
# Create the scatter plot.

fig = px.scatter(df_tsne, x='x', y='y', color='family', hover_name=df_tsne.index)
fig.update_layout(title=f"2D t-SNE of {current_order} Genera", xaxis_title="t-SNE1", yaxis_title="t-SNE2")
fig.update_layout(height=800, width=800)
fig.show()

In [12]:
current_order = 'Siluriformes'  # Change this to visualize a different order
df_tsne = pd.read_csv(tsne_by_order_output_dir / f"{current_order}_2D_tSNE_sklearn.csv", index_col=0)
# Create the scatter plot.

fig = px.scatter(df_tsne, x='x', y='y', color='family', hover_name=df_tsne.index)
fig.update_layout(title=f"2D t-SNE of {current_order} Genera", xaxis_title="t-SNE1", yaxis_title="t-SNE2")
fig.update_layout(height=800, width=800)
fig.show()

# Overlapping points

For each order, let's see how many points overlap in the MDS plot.


In [13]:
# Load in the CSV for each one and count the number of coincident points (identical x,y coordinates).

for current_order in fish_orders:
    order_path = tsne_by_order_output_dir / f"{current_order}_2D_tSNE_sklearn.csv"
    try:
        df_tsne = pd.read_csv(order_path, index_col=0)
        coord_counts = df_tsne.groupby(['x', 'y']).size()
        num_coincident = (coord_counts > 1).sum()
        total_points = len(df_tsne)
        print(f"{current_order}: {num_coincident} coincident points out of {total_points} total points.")
    except FileNotFoundError:
        print(f"File not found for order: {current_order}")
        continue


File not found for order: Lepisosteiformes
File not found for order: Amiiformes
Osmeriformes: 21 coincident points out of 142 total points.
File not found for order: Argentiniformes
File not found for order: Esociformes
Salmoniformes: 14 coincident points out of 89 total points.
Lampriformes: 0 coincident points out of 16 total points.
Ophidiiformes: 6 coincident points out of 64 total points.
Perciformes: 1020 coincident points out of 5731 total points.
File not found for order: Acanthuriformes
File not found for order: Spariformes
Tetraodontiformes: 51 coincident points out of 267 total points.
Lophiiformes: 20 coincident points out of 113 total points.
File not found for order: Centrarchiformes
Scorpaeniformes: 72 coincident points out of 547 total points.
Gasterosteiformes: 0 coincident points out of 20 total points.
Synbranchiformes: 2 coincident points out of 49 total points.
Pleuronectiformes: 59 coincident points out of 328 total points.
File not found for order: Cichliformes
A