# BioConceptVec Tutorial

This tutorial provides a fundemental introduction to our BioConceptVec models. It illustrates (1) how to load the model, (2) how to get concept vectors, and (3) how to search top K similar concepts.

## 1. Prerequisites

Install gensim to load BioConceptVec.

In [33]:
from gensim.models import KeyedVectors
import os, sys, json, numpy as np

## 2. Load BioConceptVec

Let's create a function using gensim to load BioConceptVec.

In [16]:
def load_embedding(path, binary):
    embedding = KeyedVectors.load_word2vec_format(path, binary)
    print('embedding loaded from', path)
    return embedding

Let's load any one version of BioConceptVec. This might take a few minutes to load.

In [17]:
model = load_embedding(YOUR_BIOCONCEPTVEC_PATH, binary=True)

exists
embedding loaded from /panfs/pan1.be-md.ncbi.nlm.nih.gov/bionlp/lulab/qingyu/entity_embedding/0308/pubmed_concepts_word2vec_skip.bin


## 3. Load concept vectors only (an alternative approach)

If you only need concept vectors rather than other common word vectors, you could also load the json file that contains concept vectors only.

In [29]:
with open(YOUR_JSON_PATH) as json_file:  
    concept_vectors = json.load(json_file)
print('load', len(concept_vectors), 'concepts')

load 402712 concepts


## 4. Get concept vectors

Now you could specify a concept ID and get the vector.

For the complete BioConceptVec model, you could use:

In [30]:
concept_vec = model['Gene_2997']
concept_vec

array([-0.01676855,  0.56551105,  1.00418   ,  0.13895853,  0.0487967 ,
        0.167735  , -0.07339028,  0.4984045 , -0.18709421, -0.07556584,
       -0.2731313 , -0.4116881 ,  0.089853  ,  0.48193076, -0.257556  ,
       -0.46385273,  0.21456282,  0.4653187 ,  0.54888815,  0.24953978,
        0.12102792,  0.16346483, -0.21817666,  0.17584416,  0.00181668,
        0.06124614, -0.665759  ,  0.27698943, -0.34381142, -0.29450157,
       -0.04999237, -0.42160794,  0.587909  ,  0.04599308, -0.52086115,
       -0.3301775 , -0.4439534 , -0.07623559,  1.1971241 , -0.05919534,
       -0.3013499 , -0.32670468, -0.69612557, -0.1875824 , -0.27888337,
        0.21057817,  0.1439918 , -0.5378452 ,  0.24323015, -0.19288874,
       -0.26549867,  0.57281816, -0.5705898 , -0.07732881,  0.39814204,
        0.5079281 ,  0.01936658,  0.78174835, -0.42671528, -0.10127745,
        0.7548943 ,  0.2102662 ,  0.10312578,  0.16521394,  0.10868917,
       -1.3835231 ,  0.6732168 , -0.22165756,  0.17197599, -0.63

For the json file, it is a dictionary, so the code is exactly the same:

In [34]:
np.array(concept_vectors['Gene_2997'])

array([-0.01676855,  0.56551105,  1.00417995,  0.13895853,  0.0487967 ,
        0.167735  , -0.07339028,  0.4984045 , -0.18709421, -0.07556584,
       -0.27313131, -0.41168809,  0.089853  ,  0.48193076, -0.25755599,
       -0.46385273,  0.21456282,  0.46531871,  0.54888815,  0.24953978,
        0.12102792,  0.16346483, -0.21817666,  0.17584416,  0.00181668,
        0.06124614, -0.66575903,  0.27698943, -0.34381142, -0.29450157,
       -0.04999237, -0.42160794,  0.58790898,  0.04599308, -0.52086115,
       -0.33017749, -0.44395339, -0.07623559,  1.19712412, -0.05919534,
       -0.30134991, -0.32670468, -0.69612557, -0.1875824 , -0.27888337,
        0.21057817,  0.1439918 , -0.53784519,  0.24323015, -0.19288874,
       -0.26549867,  0.57281816, -0.57058978, -0.07732881,  0.39814204,
        0.50792807,  0.01936658,  0.78174835, -0.42671528, -0.10127745,
        0.75489432,  0.2102662 ,  0.10312578,  0.16521394,  0.10868917,
       -1.38352311,  0.67321682, -0.22165756,  0.17197599, -0.63

Similarly, you could use the concept IDs that we provided in the json file to get more concept vectors.

## 5.Compute the similarity between concepts

Now we can use the concept vectors to find similar concepts.

First let's create a function to calculate the Cosine similarity

In [36]:
def cosine(a, b):
    norm1 = np.linalg.norm(a)
    norm2 = np.linalg.norm(b)
    return np.dot(a, b) / (norm1 * norm2)
    

For example, for the interleukin 10 gene (Gene_3586), let's find out which gene is more similar to it, interleukin 4 gene (Gene_3565) or HUWE1 (Gene_10075)?

In [40]:
cosine(model['Gene_3586'], model['Gene_3565'])

0.9248491

In [41]:
cosine(model['Gene_3586'], model['Gene_10075'])

0.265184

The results show that interleukin 4 gene is more similar to interleukin 10 gene. Indeed, they share GO terms.

## 6. Find top K similar terms via BioConceptVec

You can also use the embedding to find out the top K similar terms:

In [43]:
model.most_similar(positive=['Gene_3586'], topn=3)

[('Gene_3458', 0.9307081699371338),
 ('Gene_3565', 0.9248490333557129),
 ('Gene_3606', 0.9062291383743286)]