<a href="https://colab.research.google.com/github/pia-francesca/ema/blob/main/examples/Pla2g2/emmaemb_pla2g2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EmmaEmb: Comparative Analysis of Embedding Spaces

Welcome to the example Colab notebook for EmmaEmb, a Python library for analyzing and comparing embedding spaces in molecular biology. EmmaEmb provides tools to explore how different embedding models capture biological information, enabling insights into feature similarities, differences, and relationships across embeddings.

Link to GitHub: https://github.com/broadinstitute/EmmaEmb


### Notebook content

This notebook demonstrates key functionalities of EmmaEmb, including:

1. [Initialising the Emma object](#section-one)

2. [Adding embedding spaces](#section-two)

3. [Feature distribution across spaces](#section-three)

4. [Pairwise space comparison](#section-four)

![EmmaEmb Overview](https://raw.githubusercontent.com/broadinstitute/EmmaEmb/main/images/emma_overview.jpg)




## 0. Loading dependencies and data

Information of the data and embedding models can be found here: https://github.com/broadinstitute/EmmaEmb/tree/main/examples/Pla2g2

In [1]:
#@title Install dependencies
%pip install emmaemb

Collecting emmaemb
  Downloading emmaemb-1.0.3-py3-none-any.whl.metadata (9.8 kB)
Collecting plotly-express>=0.4.1 (from emmaemb)
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting plotly>=6.0.0 (from emmaemb)
  Downloading plotly-6.0.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pynndescent>=0.5.13 (from emmaemb)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting python-dateutil>=2.9.0.post0 (from emmaemb)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting umap-learn>=0.5.7 (from emmaemb)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Downloading emmaemb-1.0.3-py3-none-any.whl (16 kB)
Downloading plotly-6.0.0-py3-none-any.whl (14.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Downloading pynndescent-0.5.13-py3-none-any.

In [2]:
#@title Download example data from EmmaEmb repository

import requests
import pandas as pd
import os

# download embeddings

models = ["ESMC", "ProtT5"]
embedding_url_dir = "https://raw.githubusercontent.com/broadinstitute/EmmaEmb/main/examples/Pla2g2/embeddings/"

headers = {"User-Agent": "Mozilla/5.0"}
csv_url = "https://raw.githubusercontent.com/broadinstitute/EmmaEmb/main/examples/Pla2g2/Pla2g2_features.csv"

csv_filename = "Pla2g2_features.csv"
csv_response = requests.get(csv_url, headers=headers)
if csv_response.status_code == 200:
    with open(csv_filename, "wb") as f:
        f.write(csv_response.content)
else:
    print(f"Failed to download {csv_filename}")


df_pla2g2 = pd.read_csv(csv_filename)
proteins = df_pla2g2['identifier'].values

# now for each model download embedding files for each protein
for model in models:
  model_dir = f"embeddings/{model}"
  os.makedirs(model_dir, exist_ok=True)

  for protein in proteins:
    file_path = os.path.join(model_dir, f"{protein}.npy")

    # Check if file already exists
    if os.path.exists(file_path):
        continue

    url = f"{embedding_url_dir}{model}/{protein}.npy"
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
      # store in embeddings/model/protein-id.npy
      with open("embeddings/" + model + "/" + protein + ".npy", "wb") as f:
        f.write(response.content)
    else:
      print(f"Failed to download {url}")


print("All download of feature data complete.")

All download of feature data complete.


<a name="section-one"></a>
## 1. Initialising the Emma object

In [3]:
import pandas as pd
from emmaemb import Emma

### Feature data table

Loading feature data. The first column includes the identifiers of the samples, in this case proteins. The remaining columns contain meta data on each sample.

In [4]:
df_pla2f2 = pd.read_csv("Pla2g2_features.csv")
print(df_pla2f2.shape)
df_pla2f2.head()

(446, 7)


Unnamed: 0,identifier,gene,group,enzyme_class,species,seq_length,length_bin
0,mAimeC,Pla2g2C,C mammals,C,mammals,132,130-135
1,mAimeF,Pla2g2F,F mammals,F,mammals,136,135-145
2,mAimeDb,Pla2g2D1,D mammals,D,mammals,124,120-130
3,mAimeDa,Pla2g2D1,D mammals,D,mammals,124,120-130
4,mAimeE,Pla2g2E,E mammals,E,mammals,123,120-130


Initialising Emma object with the feature data in format of a pandas df. The datatype stored in each column of the feature data is detected. Only cateorical data will be available for downstream analysis with the Emma library.

Note: Quantitative features can be binned to allow analysis with EmmaEmb.

In [5]:
# initiate Emma object with the metadata

emma = Emma(df_pla2f2)

446 samples loaded.
Categories in meta data: ['gene', 'group', 'enzyme_class', 'species', 'length_bin']
Numerical columns in meta data: ['seq_length']


<a name="section-two"></a>
## 2. Adding embedding spaces

Adding embedding spaces. Embedding spaces can be added one by one. Either by

- providing a link to a directory which stores the embeddings in individual files with the identifiers from the feature table or

- by providing a numpy array which includes the embeddings in each row and in the same order as in the feature data table.


Multiple embedding spaces can be added. Dimensions of the embeddings do not have to be the same across embedding spaces. Embedding spaces can be removed using the `remove_emb_space(emb_space_name: str)` function.

In [6]:
embedding_dir = "embeddings/"
models = ["ProtT5", "ESMC"]

In [7]:
for model_name in models:
    emma.add_emb_space(
        embeddings_source=embedding_dir + model_name,
        emb_space_name=model_name,
    )

Embedding space 'ProtT5' added successfully.
Embeddings have 1024 features each.
Embedding space 'ESMC' added successfully.
Embeddings have 960 features each.


### Visualization of embedding spaces using dimensionality reduction techniques

The `plot_emb_space` function visualizes the embeddings of a specified embedding space in 2D using dimensionality reduction techniques such as PCA, t-SNE, or UMAP. It takes an Emma instance containing multiple embedding spaces and projects the selected space into two dimensions, optionally normalizing the data beforehand. The resulting scatter plot can be colored based on metadata attributes, allowing for an intuitive exploration of patterns within the embedding space.

In [8]:
# visualise reduced embedding space
from emmaemb.vizualization import plot_emb_space

fig_pca = plot_emb_space(emma=emma,
                         emb_space="ProtT5",
                         method="PCA",
                         color_by="enzyme_class",
                         normalise=True)
fig_pca.show()

<a name="section-three"></a>
## 3. Feature distribution across spaces

The `get_pairwise_distances` function of the Emma object calculates pairwise distances between samples within a specified embedding space using a chosen distance metric. Implemented distance metrics, include Euclidean, Manhattan, Cosine, and Mahalanobis distances, among scaled versions. To improve efficiency, the function checks if the distances have already been computed and stored in the Emma object, avoiding redundant calculations.

In [9]:
pwd = emma.get_pairwise_distances("ProtT5", "cityblock")
pwd = emma.get_pairwise_distances("ESMC", "cityblock")

Calculating pairwise distances using cityblock...
Calculating pairwise distances using cityblock...


#### 3.1 KNN feature alignment scores

The `plot_knn_alignment_across_embedding_spaces` function visualizes k-nearest neighbor (KNN) feature alignment scores for a specified feature across multiple embedding spaces.
It computes what fraction of the KNN embeddings are labelled with the same label for the selected feature.
The scores are calculated for each embedding in each embedding space.
Pairwise distances need to be pre-computed for each embedding space and each distance metric using the `get_pairwise_distances` function (see code block above).
The function produces a box plot and allows customization of the embedding space order and plot color.

In [10]:
# KNN ALIGNMENT SCORES
from emmaemb.vizualization import plot_knn_alignment_across_embedding_spaces

fig_alignment_scores = plot_knn_alignment_across_embedding_spaces(
    emma, feature="enzyme_class", k=10, metric="cityblock"
)
fig_alignment_scores.update_layout(height=600, width=500)
fig_alignment_scores.show()

The KNN feature alignment scores can also be aggregated. The `plot_knn_alignment_across_classes` shows a heatmap of the mean value of the KNN feature alignment scores stratified by embedding space and feature class.

In [11]:
from emmaemb.vizualization import plot_knn_alignment_across_classes

fig_alignment_scores_class = plot_knn_alignment_across_classes(
    emma, feature="enzyme_class", k=100, metric="cityblock"
)
fig_alignment_scores_class.update_layout(height=600, width=500)
fig_alignment_scores_class.show()

### 3.2 KNN class mixing matrix

The KNN class mixing matrix quantifies the mixing of classes within the KNN neighborhood of samples in a given embedding space.
Given a distance metric (default: euclidean), the KNN are retrieved for each sample and the KNN class `get_class_mixing_in_neighborhood` counts how often different classes appear among its neighbors. The function returns a class mixing matrix, where each entry represents the number of times a class appears in the neighborhood of another class, along with the unique class labels. The heatmap can be visualised using the `plot_knn_class_mixing_matrix` function.

In [12]:
# KNN CLASS MIXING MATRIX
from emmaemb.vizualization import plot_knn_class_mixing_matrix

fig_class_mixing_matrix = plot_knn_class_mixing_matrix(
    emma,
    emb_space="ProtT5",
    feature="enzyme_class",
    k=100,
    metric="cityblock",
)
fig_class_mixing_matrix.update_layout(height=600, width=600)
fig_class_mixing_matrix.show()

<a name="section-four"></a>
## 4. Pairwise space comparison

### 4.1 Global comparison of pairwise distances

The `plot_pairwise_distance_comparison` function generates a scatter plot to compare pairwise distances between samples in two different embedding spaces. Using a specified distance metric (default: euclidean), it shows the distances for the same set of samples across both embedding spaces.
Additionally it computes the Spearman correlation coefficient between the pairwise distances in the two selected embedding spaces.
The function allows customization of plot title, color, and scatter dot opacity, and optionally groups points based on a meta data feature, enabling insights into how different sample categories behave across embeddings.

In [13]:
from emmaemb.vizualization import plot_pairwise_distance_comparison

fig_pwd_comparison = plot_pairwise_distance_comparison(
    emma,
    emb_space_y="ProtT5",
    emb_space_x="ESMC",
    metric="cityblock",
    group_by="species",
)
fig_pwd_comparison.update_layout(height=600, width=600)
fig_pwd_comparison.show()

### 4.2 Cross- space neighborhood similarity

The `plot_low_similarity_distribution` function visualizes the class distribution of samples with low neighborhood similarity between two embedding spaces. It computes neighborhood similarity scores based on a specified distance metric (default: euclidean) and identifies samples where similarity of the nearest neighbors of a data point falls below a given threshold. The function then compares the class distribution of these low-similarity samples to the overall dataset distribution using a scatter plot. This helps assess whether certain classes exhibit higher or lower structural consistency across embeddings, providing insights into differences in how embeddings capture relationships between samples.

In [15]:
from emmaemb.vizualization import plot_low_similarity_distribution

fig_low_similarity_class_distribution = plot_low_similarity_distribution(
    emma,
    emb_space_1="ProtT5",
    emb_space_2="ESMC",
    feature="enzyme_class",
    k=10,
    metric="cityblock",
    similarity_threshold=0.3,
)
fig_low_similarity_class_distribution.update_layout(height=600, width=600)
fig_low_similarity_class_distribution.show()