# BioSentVec Tutorial

This tutorial provides a fundemental introduction to our BioSentVec models. It illustrates (1) how to load the model, (2) an example function to preprocess sentences, (3) an example application that uses the model and (4) further resources for using the model more broadly.

## 1. Prerequisites

Please download BioSentVec model and install all the related python libraries

In [1]:
import pandas as pd
import sent2vec
import numpy as np
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from scipy.spatial import distance

## 2. Load BioSentVec model

Please specify the location of the BioSentVec model to model_path. It may take a while to load the model at the first time. Pre-trained model can be found here: https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioSentVec_PubMed_MIMICIII-bigram_d700.bin. For information on installing sent2vec, refer to following readme: https://github.com/epfml/sent2vec/tree/master


In [2]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /data/home/bty381/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /data/home/bty381/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
model_path = 'BioSentVec_PubMed_MIMICIII-bigram_d700.bin'
model = sent2vec.Sent2vecModel()
try:
    model.load_model(model_path)
except Exception as e:
    print(e)
print('model successfully loaded')

## 3. Preprocess sentences

There is no one-size-fits-all solution to preprocess sentences. We demonstrate a representative code example as below. This is also consistent with the preprocessing appaorach when we trained BioSentVec models.

In [None]:
stop_words = set(stopwords.words('english'))
def preprocess_sentence(text):
    text = text.replace('/', ' / ')
    text = text.replace('.-', ' .- ')
    text = text.replace('.', ' . ')
    text = text.replace('\'', ' \' ')
    text = text.replace(';', ' ; ')
    text = text.lower()

    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]

    return ' '.join(tokens)

An example of using the preprocess_sentence function: 

In [None]:
sentence = preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.')
print(sentence)

## 4. Retrieve a sentence vector

Once a sentence is preprocessed, we can pass it to the BioSentVec model to retrieve a vector representation of the sentence.

In [None]:
sentence_vector = model.embed_sentence(sentence)
print(sentence_vector)

Note that you can also use embed_sentences to retrieve vector representations of multiple sentences.

The shape of the vector representation depends on the dimension parameter. In this case, we set the dimension to 700: 

In [None]:
print(sentence_vector.shape)

## 5. Compute sentence similarity

In this section, we demonstrate how to compute the sentence similarity between a sentence pair using the BioSentVec model. We firstly use the above code examples to get vector representations of sentences. Then we compute the cosine similarity between the pair.

In [None]:
sentence_vector1 = np.squeeze(model.embed_sentence(preprocess_sentence('In vitro anticancer activity against 2 NCI SCLC cell lines; inactive')), axis=0)
sentence_vector2 = np.squeeze(model.embed_sentence(preprocess_sentence('In vitro anticancer activity against 11 NCI NSCLC cell lines; inactive')), axis=0)
sentence_vector3 = np.squeeze(model.embed_sentence(preprocess_sentence('In vitro anticancer activity against 6 NCI ovarian cancer cell lines; inactive')), axis=0)

cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print('cosine similarity:', cosine_sim)
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector3)
print('cosine similarity:', cosine_sim)
cosine_sim = 1 - distance.cosine(sentence_vector3, sentence_vector2)
print('cosine similarity:', cosine_sim)

# Code to compute embeddings for assay descriptions

In [41]:
df = pd.read_csv('processed_data.csv')
# Define a function to compute embeddings
def compute_embeddings(row):
    text = row['description']
    embedding = np.squeeze(model.embed_sentence(preprocess_sentence(text)), axis=0)
    return embedding

# Apply the function to each row and create a new 'embeddings' column
df['embeddings'] = df.apply(compute_embeddings, axis=1)

In [47]:
df.to_csv('assays_with_embeddings.csv')
#test = pd.read_csv('assays_with_embeddings.csv')

Here is another example for a pair that is relatively less similar.
print(test.head())

In [23]:
sentence_vector3 = model.embed_sentence(preprocess_sentence('Furthermore, increased CREB expression in breast tumors is associated with poor prognosis, shorter survival and higher risk of metastasis.'))
cosine_sim = 1 - distance.cosine(sentence_vector, sentence_vector3)
print('cosine similarity:', cosine_sim)

ValueError: Input vector should be 1-D.

## 6. More resources

The above example demonstrates an unsupervised way to use the BioSentVec model. In addition, we summarize a few useful resources:

#### (1) The Sent2vec homepage (https://github.com/epfml/sent2vec) has a few pre-trained sentence embeddings from general English copora. 
#### (2) You can also develop deep learning models to learn sentence similarity in a supervised manner.
#### (3) You can also use the BioSentVec in other applications, such as multi-label classification.

## Reference

When using some of our pre-trained models for your application, please cite the following paper:

Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. 2018. arXiv:1810.09302.