**KB Concept Embedding**

- I have saved the QuickGO's ontology definition for each entity in both datasets: CONLL and STANDOFF
- CONLL, STANDOFF dataset definitions are saved to QuickGOCONLL table and QuickGOSTANDOFF table, respectively.
- The paper utilizes a sentence embedding model (Subramanian et al. 2018) to generate a representation for the ontology definition. 
- The ontology definition is obtained in the previous step from QuickGO API and saved to a SQLite local instance.


- The library utilized by the authors is available at [GenSen - GitHub](https://github.com/Maluuba/gensen)
- However, when we run the script to download the pre-trained models and data required for this sentence embedder, it throws: ERROR 409: Public access is not permitted on this storage account..

- Alternative readily-available libraries that can produce sentence-embeddings include [Sentence-Transformers](https://www.sbert.net/), [Universal Sentence Encoder](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder), and [SpaCy](https://spacy.io/). 

- Due to API simplicity, I will proceed with Sentence-Transformers to produce an embedding for ontology definition. 
- The dimensions of embedding vector in Sentence-Transformers is predetermined (usually well above 100), however, we need a 100-dimensional vector for KBs concept embedding. 
- I will utilize Principal Component Analysis (PCA) to reduce the dimensionality to 100 dimensions. 


In [None]:
# !pip install sentence-transformers
# !pip install scikit-learn
#!pip install faker

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import numpy as np
import sqlite3

In [None]:
# # Example usage of Sentence Transformer
import random
from faker import Faker

# Generate 100 fake sentences
fake = Faker()
fakeSentences = [fake.sentence(nb_words=random.randint(5, 15)) for _ in range(100)]

# initialize models
model = SentenceTransformer("all-MiniLM-L6-v2")
pca = PCA(n_components=100)

# create embeddings
embeddings = model.encode(fakeSentences)
reducedEmbeddings = pca.fit_transform(embeddings)

print("Original Embeddings Shape: ", np.array(embeddings).shape)
print("Reduced Embeddings Shape: ", reducedEmbeddings.shape)
print(reducedEmbeddings[0])

In [None]:
class OntologyDefinitionEmbedder:
    
    def __init__(self, sqlitePath, table,pcaComponents=100, sentenceTransformerType="all-MiniLM-L6-v2"):
        self.entityGOdefinitionEmbedding = {}
        self.sqliteConn = sqlite3.connect(sqlitePath)
        self.table = table
        self.pcaComponents = pcaComponents
        self.sentenceTransformerType = sentenceTransformerType
        self.readSQLiteData()
        
    def readSQLiteData(self):
        
        print("Reading GO definitions from SQLite")
        cursor = self.sqliteConn.cursor()
        
        query = """
        SELECT entity, GO_definition
        FROM {table}
        WHERE GO_definition != '' AND GO_definition IS NOT NULL;
        """.format(table=self.table)
        
        cursor.execute(query)
        rows = cursor.fetchall()
        self.sqliteConn.close()
        
        print(f"Done Reading {len(rows)} GO definitions from SQLite")
        self.createEmbeddings(rows)
        
    def createEmbeddings(self, rows):
        print("Creating embeddings for GO_definitions")
        model = SentenceTransformer(self.sentenceTransformerType)
        pca = PCA(n_components=self.pcaComponents)
        
        entities = []
        GOdefs = []
        
        for row in rows:
            entities.append(row[0])
            GOdefs.append(row[1])
        
        definitionEmbeddings = model.encode(GOdefs)
        reducedEmbeddings = pca.fit_transform(definitionEmbeddings)
        
        self.entityGOdefinitionEmbedding = dict(zip(entities, reducedEmbeddings))
        print(f"Done Creating embeddings for {len(entities)} GO_definitions corresponding to entities")
        
    def getEmbeddingForEntity(self, entity):
        return self.entityGOdefinitionEmbedding.get(entity, np.zeros(self.pcaComponents))

In [None]:
sqlitePath = "../QuickGO.db"
# table = "QuickGOCONLL"
table = "QuickGOSTANDOFF"
ontologyDefinitionEmbedder = OntologyDefinitionEmbedder(sqlitePath, table)

In [None]:
exampleEntityEmbedding = ontologyDefinitionEmbedder.getEmbeddingForEntity("angiogenesis")
exampleEntityEmbeddingDNE = ontologyDefinitionEmbedder.getEmbeddingForEntity("IDONOTEXIST")
print(exampleEntityEmbedding)
print(exampleEntityEmbeddingDNE)
