**KB Concept Embedding**

- The paper utilizes a sentence embedding model (Subramanian et al. 2018) to generate a representation for the ontology definition. 
- The ontology definition is obtained in the previous step from QuickGO API and saved to a SQLite local instance.
- The library utilized by the authors is available at [GenSen - GitHub](https://github.com/Maluuba/gensen)
- However, when we run the script to download the pre-trained models and data required for this sentence embedder, it throws: ERROR 409: Public access is not permitted on this storage account..

- Alternative readily-available libraries that can produce sentence-embeddings include [Sentence-Transformers](https://www.sbert.net/), [Universal Sentence Encoder](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder), and [SpaCy](https://spacy.io/). 

- Due to API simplicity, I will proceed with Sentence-Transformers to produce an embedding for ontology definition. 
- The dimensions of embedding vector in Sentence-Transformers is predetermined (usually well above 100), however, we need a 100-dimensional vector for KBs concept embedding. 
- I will utilize Principal Component Analysis (PCA) to reduce the dimensionality to 100 dimensions. 


In [None]:
# !pip install sentence-transformers
# !pip install scikit-learn
#!pip install faker

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import numpy as np
import sqlite3

In [None]:
# # Example usage of Sentence Transformer
import random
from faker import Faker

# Generate 100 fake sentences
fake = Faker()
fake_sentences = [fake.sentence(nb_words=random.randint(5, 15)) for _ in range(100)]

# initialize models
model = SentenceTransformer("all-MiniLM-L6-v2")
pca = PCA(n_components=100)

# create embeddings
embeddings = model.encode(fake_sentences)
reduced_embeddings = pca.fit_transform(embeddings)

print("Original Embeddings Shape: ", np.array(embeddings).shape)
print("Reduced Embeddings Shape: ", reduced_embeddings.shape)
print(reduced_embeddings[0])

In [None]:
class OntologyDefinitionEmbedder:
    
    def __init__(self, sqlite_path, pca_components=100, sentence_transformer_type="all-MiniLM-L6-v2"):
        self.entity_GOdefinition_embedding = {}
        self.sqlite_conn = sqlite3.connect(sqlite_path)
        self.pca_components = pca_components
        self.sentence_transformer_type = sentence_transformer_type
        self.read_SQLite_data()
        
    def read_SQLite_data(self):
        
        print("Reading GO definitions from SQLite")
        cursor = self.sqlite_conn.cursor()
        query = """
        SELECT entity, GO_definition
        FROM QuickGO
        WHERE GO_definition != '' AND GO_definition IS NOT NULL;
        """
        cursor.execute(query)
        rows = cursor.fetchall()
        print(f"Done Reading {len(rows)} GO definitions from SQLite")
        self.create_embeddings(rows)
        
    def create_embeddings(self, rows):
        print("Creating embeddings for GO_definitions")
        model = SentenceTransformer(self.sentence_transformer_type)
        pca = PCA(n_components=self.pca_components)
        
        entities = []
        GO_defs = []
        
        for row in rows:
            entities.append(row[0])
            GO_defs.append(row[1])
        
        definition_embeddings = model.encode(GO_defs)
        reduced_embeddings = pca.fit_transform(definition_embeddings)
        
        self.entity_GOdefinition_embedding = dict(zip(entities, reduced_embeddings))
        print(f"Done Creating embeddings for {len(entities)} GO_definitions corresponding to entities")
        
    def get_reduced_embedding_for_entity(self, entity):
        return self.entity_GOdefinition_embedding.get(entity, np.zeros(self.pca_components))

In [None]:
sqlite_path = "../QuickGO.db"
ontology_def_embedder = OntologyDefinitionEmbedder(sqlite_path)

In [None]:
example_entity_embedding = ontology_def_embedder.get_reduced_embedding_for_entity("striatum")
example_entity_embedding_DNE = ontology_def_embedder.get_reduced_embedding_for_entity("IDONOTEXIST")
print(example_entity_embedding)
print(example_entity_embedding_DNE)
