# Entity Similarity Demo

Entity similarity is calculated based on topological embeddings of terms in the DKG
produced by the Second-order LINE algorithm described in
[LINE: Large-scale Information Network Embedding](https://arxiv.org/pdf/1503.03578).
This means that the relationships (i.e., edges) between edges are used to make nodes
that are connected to similar nodes more similar in dense vector space.
Note: the current embedding approach does **not** take into account entities' lexical features (labels, descriptions, and synonyms).

The cosine similarity between two embedding vectors is defined as the dot product
between the vectors divided by the L2 norm (i.e., magnitude) of each
vector. It ranges from [-1,1], where -1 represents two entities that are
very dissimilar, 0 represents entities that are not similar, and 1 represents
entities that are similar. This is calculated using :func:`scipy.spatial.distance.cosine`.

We normalize this onto a range of [0,1] such that 0 means very dissimilar, 0.5
means not similar, and 1 means similar. This is accomplished with the transform:

> `normalized_cosine = (2 - scipy.spatial.distance.cosine(X, Y)) / 2`

In [1]:
import requests
import pandas as pd

Documentation for the entity similarity endpoint can be found at http://34.230.33.149:8771/docs#/entities/entity_similarity_api_entity_similarity_post. It takes in compact URIs (CURIEs), which are the "primary keys" for terms in the DKG. It then performs an all-by-all comparison of sources and targets.

In [10]:
URL = "http://127.0.0.1:8771/api/entity_similarity"

def get_similarities_df(sources, targets=None):
    if targets is None:
        targets = sources
    res = requests.post(URL, json={"sources": sources, "targets": targets})
    res.raise_for_status()
    df = pd.DataFrame(res.json())

    curies = ",".join(sorted(set(df.source).union(df.target)))
    res = requests.get(f"http://127.0.0.1:8771/api/entities/{curies}").json()
    names = {record['id']: record['name'] for record in res}
    
    assert "similarity" in df.columns
    df["source_name"]=df['source'].map(names)
    df["target_name"]=df['target'].map(names)
    return df[["source", "source_name", "target", "target_name", "similarity"]]

Tom's example ([in this thread](https://askemgroup.slack.com/archives/C03THCGK2DU/p1704310487727779)) has us comparing `ido:0000514` (susceptible population) and `ido:0000511` (infected population). We see that these nodes are related, so their cross-comparison has a value over 0.5. The self comparison always will come out to 1.0.

In [11]:
get_similarities_df(
  [
    "ido:0000514", # susceptible population
    "ido:0000592", # immune population
    "vo:0004921", # = human age
    "ido:0000511", # = infected population
    "ido:0000512", # = diseased population
    "apollosv:00000233", #  = infected population
  ]
)

Unnamed: 0,source,source_name,target,target_name,similarity
0,ido:0000514,susceptible population,ido:0000514,susceptible population,1.0
1,ido:0000514,susceptible population,ido:0000592,immune population,0.537801
2,ido:0000514,susceptible population,vo:0004921,human age,0.56511
3,ido:0000514,susceptible population,ido:0000511,infected population,0.555182
4,ido:0000514,susceptible population,ido:0000512,diseased population,0.639745
5,ido:0000514,susceptible population,apollosv:00000233,infected population,0.516487
6,ido:0000592,immune population,ido:0000514,susceptible population,0.537801
7,ido:0000592,immune population,ido:0000592,immune population,1.0
8,ido:0000592,immune population,vo:0004921,human age,0.374794
9,ido:0000592,immune population,ido:0000511,infected population,0.48462


Unfortunately, we see that the similarity between apollosv:00000233 (infected population) and ido:0000511 (infected population), which are two different terms from different ontologies describing the same concept, do not have a high similarity. This is probably due to the fact that the edge annotations on IDO terms are much more prevalent than APOLLO_SV terms, and therefore the topological similarity wasn't able to reflect that.

Here's a few ideas on how to remedy this:

1. Include the equivalence edges into the DKG embedding step (they are currently just properties of nodes)
2. Use SeMRA to automatically collapse nodes together either during the whole DKG build or during the embedding step
3. Include lexical information in the entity similarity in addition to toplogical similarity