# BioSentVec Tutorial

This tutorial provides a fundemental introduction to our BioSentVec models. It illustrates (1) how to load the model, (2) an example function to preprocess sentences, (3) an example application that uses the model and (4) further resources for using the model more broadly.

## 1. Prerequisites

Please download BioSentVec model and install all the related python libraries

In [25]:
import sent2vec
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from scipy.spatial import distance

## 2. Load BioSentVec model

Please specify the location of the BioSentVec model to model_path. It may take a while to load the model at the first time.

In [32]:
model_path = YOUR_MODEL_LOCATION
model = sent2vec.Sent2vecModel()
try:
    model.load_model(model_path)
except Exception as e:
    print(e)
print('model successfully loaded')

model successfully loaded


## 3. Preprocess sentences

There is no one-size-fits-all solution to preprocess sentences. We demonstrate a representative code example as below. This is also consistent with the preprocessing approach when we trained BioSentVec models.

In [15]:
stop_words = set(stopwords.words('english'))
def preprocess_sentence(text):
    text = text.replace('/', ' / ')
    text = text.replace('.-', ' .- ')
    text = text.replace('.', ' . ')
    text = text.replace('\'', ' \' ')
    text = text.lower()

    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]

    return ' '.join(tokens)

An example of using the preprocess_sentence function: 

In [18]:
sentence = preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.')
print(sentence)

breast cancers her2 amplification higher risk cns metastasis poorer prognosis


## 4. Retrieve a sentence vector

Once a sentence is preprocessed, we can pass it to the BioSentVec model to retrieve a vector representation of the sentence.

In [20]:
sentence_vector = model.embed_sentence(sentence)
print(sentence_vector)

[[ 0.27253592  0.04016513 -0.13868049  0.06607066  0.03410426  0.03702081
   0.04780459  0.318374    0.1389506   0.14894584  0.03802885  0.16076139
   0.27367333  0.28947747 -0.3635127   0.1523829   0.00113982  0.15947492
  -0.00115095 -0.3911827   0.06040372 -0.30060792  0.5700456  -0.3073153
   0.05641874 -0.38538572  0.03242918 -0.01758919 -0.53824794 -0.2036874
   0.09088504  0.42208442  0.01777515  0.26457042  0.00444555 -0.4244185
   0.08552625 -0.01220523 -0.52954006 -0.19729511  0.3146897   0.39812556
  -0.73728865 -0.15572241  0.12493155 -0.189124    0.30150056 -0.13335498
  -0.22929646  0.1923776  -0.25276372  0.48184827 -0.11678692  0.074292
  -0.3565283   0.06902904 -0.16303737 -0.1516651  -0.16457589  0.2640424
  -0.2330729   0.03231101  0.3361209   0.35289383 -0.23463576 -0.29648
  -0.3083266   0.39252853 -0.24566592 -0.2444962   0.20645703 -0.04719147
   0.10580424  0.00649089 -0.2572806  -0.333023   -0.03018534 -0.042082
  -0.03446042  0.1267659   0.37817308 -0.38865507

Note that you can also use embed_sentences to retrieve vector representations of multiple sentences.

The shape of the vector representation depends on the dimension parameter. In this case, we set the dimension to 700: 

In [21]:
print(sentence_vector.shape)

(1, 700)


## 5. Compute sentence similarity

In this section, we demonstrate how to compute the sentence similarity between a sentence pair using the BioSentVec model. We firstly use the above code examples to get vector representations of sentences. Then we compute the cosine similarity between the pair.

In [27]:
sentence_vector1 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.'))
sentence_vector2 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification are more aggressive, have a higher risk of CNS metastasis, and poorer prognosis.'))

cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print('cosine similarity:', cosine_sim)

cosine similarity: 0.9813870787620544


Here is another example for a pair that is relatively less similar.

In [29]:
sentence_vector3 = model.embed_sentence(preprocess_sentence('Furthermore, increased CREB expression in breast tumors is associated with poor prognosis, shorter survival and higher risk of metastasis.'))
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector3)
print('cosine similarity:', cosine_sim)

cosine similarity: 0.7300089001655579


## 6. More resources

The above example demonstrates an unsupervised way to use the BioSentVec model. In addition, we summarize a few useful resources:

#### (1) The Sent2vec homepage (https://github.com/epfml/sent2vec) has a few pre-trained sentence embeddings from general English copora. 
#### (2) You can also develop deep learning models to learn sentence similarity in a supervised manner.
#### (3) You can also use the BioSentVec in other applications, such as multi-label classification.

## Reference

When using some of our pre-trained models for your application, please cite the following paper:

Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. 2018. arXiv:1810.09302.