**KB Concept Embedding**

- I have saved the QuickGO's ontology definition for each entity in both datasets: CONLL and STANDOFF
- CONLL, STANDOFF dataset definitions are saved to QuickGOCONLL table and QuickGOSTANDOFF table, respectively.
- The paper utilizes a sentence embedding model (Subramanian et al. 2018) to generate a representation for the ontology definition. 
- The ontology definition is obtained in the previous step from QuickGO API and saved to a SQLite local instance.


- The library utilized by the authors is available at [GenSen - GitHub](https://github.com/Maluuba/gensen)
- However, when we run the script to download the pre-trained models and data required for this sentence embedder, it throws: ERROR 409: Public access is not permitted on this storage account..

- Alternative readily-available libraries that can produce sentence-embeddings include [Sentence-Transformers](https://www.sbert.net/), [Universal Sentence Encoder](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder), and [SpaCy](https://spacy.io/). 

- Due to API simplicity, I will proceed with Sentence-Transformers to produce an embedding for ontology definition. 
- The dimensions of embedding vector in Sentence-Transformers is predetermined (usually well above 100), however, we need a 100-dimensional vector for KBs concept embedding. 
- I will utilize Principal Component Analysis (PCA) to reduce the dimensionality to 100 dimensions. 


In [1]:
# !pip install sentence-transformers
# !pip install scikit-learn
#!pip install faker

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import numpy as np
import sqlite3

In [3]:
# # Example usage of Sentence Transformer
import random
from faker import Faker

# Generate 100 fake sentences
fake = Faker()
fakeSentences = [fake.sentence(nb_words=random.randint(5, 15)) for _ in range(100)]

# initialize models
model = SentenceTransformer("all-MiniLM-L6-v2")
pca = PCA(n_components=100)

# create embeddings
embeddings = model.encode(fakeSentences)
reducedEmbeddings = pca.fit_transform(embeddings)

print("Original Embeddings Shape: ", np.array(embeddings).shape)
print("Reduced Embeddings Shape: ", reducedEmbeddings.shape)
print(reducedEmbeddings[0])

Original Embeddings Shape:  (100, 384)
Reduced Embeddings Shape:  (100, 100)
[ 2.15076152e-02  2.09832311e-01 -1.08795092e-01 -1.90676868e-01
  1.64481774e-01  1.37946710e-01  7.47257471e-02 -1.49772435e-01
  2.09792107e-01  1.00864666e-02 -5.80841303e-02 -2.11697295e-01
 -6.49506226e-02 -1.23653501e-01 -8.27922374e-02 -2.08600238e-01
  1.31045068e-02 -4.37145792e-02 -8.73954780e-03  2.39319086e-01
  8.03546831e-02 -1.34598240e-01  2.61495728e-02 -5.56042157e-02
 -1.63719013e-01 -3.94229703e-02 -2.11805161e-02  1.76812977e-01
 -1.03446677e-01 -7.32576028e-02  4.68325103e-03 -9.15732048e-03
  2.28551820e-01 -2.76175886e-03  1.20742865e-01  7.18644559e-02
 -2.06161719e-02 -2.19293479e-02  3.90015766e-02  7.08631948e-02
  2.27317605e-02 -1.97708547e-01  2.21949667e-01 -4.18668799e-02
  1.21047862e-01 -9.22309682e-02  2.11949460e-02 -5.56671508e-02
 -3.18974331e-02  1.04516611e-01  2.52879914e-02  3.28234918e-02
 -6.75778277e-03  1.02122575e-01 -4.79003489e-02  1.40124355e-02
 -1.12485820e

In [4]:
class OntologyDefinitionEmbedder:
    
    def __init__(self, sqlitePath, table,pcaComponents=100, sentenceTransformerType="all-MiniLM-L6-v2"):
        self.entityGOdefinitionEmbedding = {}
        self.sqliteConn = sqlite3.connect(sqlitePath)
        self.table = table
        self.pcaComponents = pcaComponents
        self.sentenceTransformerType = sentenceTransformerType
        self.readSQLiteData()
        
    def readSQLiteData(self):
        
        print("Reading GO definitions from SQLite")
        cursor = self.sqliteConn.cursor()
        
        query = """
        SELECT entity, GO_definition
        FROM {table}
        WHERE GO_definition != '' AND GO_definition IS NOT NULL;
        """.format(table=self.table)
        
        cursor.execute(query)
        rows = cursor.fetchall()
        self.sqliteConn.close()
        
        print(f"Done Reading {len(rows)} GO definitions from SQLite")
        self.createEmbeddings(rows)
        
    def createEmbeddings(self, rows):
        print("Creating embeddings for GO_definitions")
        model = SentenceTransformer(self.sentenceTransformerType)
        pca = PCA(n_components=self.pcaComponents)
        
        entities = []
        GOdefs = []
        
        for row in rows:
            entities.append(row[0])
            GOdefs.append(row[1])
        
        definitionEmbeddings = model.encode(GOdefs)
        reducedEmbeddings = pca.fit_transform(definitionEmbeddings)
        
        self.entityGOdefinitionEmbedding = dict(zip(entities, reducedEmbeddings))
        print(f"Done Creating embeddings for {len(entities)} GO_definitions corresponding to entities")
        
    def getEmbeddingForEntity(self, entity):
        return self.entityGOdefinitionEmbedding.get(entity, np.zeros(self.pcaComponents))

In [5]:
sqlitePath = "../QuickGO.db"
# table = "QuickGOCONLL"
table = "QuickGOSTANDOFF"
ontologyDefinitionEmbedder = OntologyDefinitionEmbedder(sqlitePath, table)

Reading GO definitions from SQLite
Done Reading 2528 GO definitions from SQLite
Creating embeddings for GO_definitions
Done Creating embeddings for 2528 GO_definitions corresponding to entities


In [6]:
exampleEntityEmbedding = ontologyDefinitionEmbedder.getEmbeddingForEntity("angiogenesis")
exampleEntityEmbeddingDNE = ontologyDefinitionEmbedder.getEmbeddingForEntity("IDONOTEXIST")
print(exampleEntityEmbedding)
print(exampleEntityEmbeddingDNE)


[-0.21365184  0.43349725 -0.33151343 -0.20256285  0.07304411  0.1434427
  0.12300024 -0.17338209  0.11258581  0.02974366 -0.02566108  0.01236665
 -0.08322132  0.07523662 -0.02602699 -0.0477213   0.0394746  -0.00344671
 -0.07337301 -0.03498936 -0.1293745  -0.06551908 -0.05756119  0.02663484
 -0.00282967  0.02529964  0.06037331  0.09391612 -0.06256899  0.05370912
  0.09937633 -0.05414293  0.06022435  0.04373126  0.04766092 -0.04245023
 -0.07007768 -0.12678671 -0.03897177 -0.0147526  -0.01271358  0.01991127
  0.00611725  0.01438462 -0.02243987 -0.07228547 -0.03894063 -0.04607353
 -0.02574906  0.0053082  -0.08145685  0.01793898 -0.00831993  0.03044647
  0.00185351  0.07601354  0.02878074 -0.07895198  0.03439514 -0.01432682
 -0.04036571 -0.06453322  0.04099624  0.05552135  0.07116162  0.03468121
 -0.05444211  0.06675231  0.00464143  0.02575271  0.05574566  0.06569096
 -0.06603791 -0.08505224 -0.04144668 -0.01647161  0.12321257  0.03027159
 -0.04481766 -0.01044078 -0.10291919 -0.03137582  0.