# AI-Driven Clinical Decision Support: Enhancing Disease Diagnosis Exploiting Patients Similarity

This provides the sentence similarity and word similarity scores using BioSentVec and BioWordVec models.

## 1. Prerequisites

Download pre-trained BioSentVec model and install all the related python libraries

In [23]:
import sent2vec
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from scipy.spatial import distance

Install the necessary python libraries to access the Universal Sentence Encoder Model.

In [24]:
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

## 2. Load BioSentVec model

In [2]:
model_path = '../test_data/BioSentVec_PubMed_MIMICIII-bigram_d700.bin'
model = sent2vec.Sent2vecModel()
try:
    model.load_model(model_path)
except Exception as e:
    print(e)
print('model successfully loaded')

model successfully loaded


## 3. Load Universal Sentence Encoder model

Note that this is the standard version of the universal sentence encoder and is only able to process the english language. Other versions offer a broader language variety and are suited for different forms of textual input.

In [25]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
universal_sentence_encoder = hub.load(module_url)
logging.set_verbosity(logging.ERROR)

## 4. Preprocess sentences

In [27]:
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def preprocess_sentence(text):
    text = text.replace('/', ' / ')
    text = text.replace('.-', ' .- ')
    text = text.replace('.', ' . ')
    text = text.replace('\'', ' \' ')
    text = text.lower()

    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]

    return ' '.join(tokens)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\akhil\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


An example of using the preprocess_sentence function: 

In [28]:
import nltk
nltk.download('punkt')
sentence = preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.')
print(sentence)

breast cancers her2 amplification higher risk cns metastasis poorer prognosis


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\akhil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 5. Retrieve a sentence vector

#### 1. BioSentVec

Once a sentence is preprocessed, we can pass it to the BioSentVec model to retrieve a vector representation of the sentence.

In [5]:
sentence_vector = model.embed_sentence(sentence)
print(sentence_vector)

[[ 0.27253592  0.04016513 -0.13868049  0.06607066  0.03410426  0.03702081
   0.04780459  0.318374    0.1389506   0.14894584  0.03802885  0.16076139
   0.27367333  0.28947747 -0.3635127   0.1523829   0.00113982  0.15947492
  -0.00115095 -0.3911827   0.06040372 -0.30060792  0.5700456  -0.3073153
   0.05641874 -0.38538572  0.03242918 -0.01758919 -0.53824794 -0.2036874
   0.09088504  0.42208442  0.01777515  0.26457042  0.00444555 -0.4244185
   0.08552625 -0.01220523 -0.52954006 -0.19729511  0.3146897   0.39812556
  -0.73728865 -0.15572241  0.12493155 -0.189124    0.30150056 -0.13335498
  -0.22929646  0.1923776  -0.25276372  0.48184827 -0.11678692  0.074292
  -0.3565283   0.06902904 -0.16303737 -0.1516651  -0.16457589  0.2640424
  -0.2330729   0.03231101  0.3361209   0.35289383 -0.23463576 -0.29648
  -0.3083266   0.39252853 -0.24566592 -0.2444962   0.20645703 -0.04719147
   0.10580424  0.00649089 -0.2572806  -0.333023   -0.03018534 -0.042082
  -0.03446042  0.1267659   0.37817308 -0.38865507

Note that you can also use embed_sentences to retrieve vector representations of multiple sentences.

The shape of the vector representation depends on the dimension parameter. In this case, we set the dimension to 700: 

In [6]:
print(sentence_vector.shape)

(1, 700)


#### 2. Universal Sentence Encoder

Once a sentence is preprocessed, we can pass it to the Universal Sentence Encoder model to retrieve a vector representation of the sentence.

In [29]:
sentence_vector = universal_sentence_encoder([sentence])
print(sentence_vector)

tf.Tensor(
[[-0.01706054 -0.02982263 -0.06450873 -0.00945025 -0.03946786 -0.077258
  -0.04527489 -0.06925316 -0.00476587  0.05182612  0.08069605 -0.05177524
  -0.04592242 -0.04435806  0.01292375 -0.04330221 -0.08106487 -0.03709086
   0.06627572 -0.03122658  0.01539265  0.07002871  0.02392466 -0.01713471
   0.0310583  -0.03737702  0.027232   -0.02219061 -0.0124025   0.07148183
   0.05177425  0.08108076 -0.00427653 -0.07375586 -0.05576143  0.01741701
  -0.04468859  0.06967676 -0.04987274 -0.05859317  0.03158952 -0.02929872
   0.0215015  -0.02629163  0.04192941 -0.04903167 -0.05458231 -0.04037523
  -0.06134297  0.06647678 -0.04487872 -0.04345295 -0.06080464 -0.04348946
  -0.05720161 -0.0214076  -0.03062622  0.01864022  0.0283277   0.01318153
  -0.05516872 -0.06180546  0.03157061 -0.03427697 -0.06046344 -0.01410388
   0.02567874  0.03303296  0.06351852  0.0652984   0.02503775  0.06748096
   0.00462156  0.04772187  0.0510456   0.02751398 -0.01434254  0.07380927
   0.00462101 -0.05693014  0.

The standard shape for any vector representation is 512.

In [30]:
print(sentence_vector.shape)

(1, 512)


## 6. Compute sentence similarity

#### 1. BioSentVec

Compute the sentence similarity between a sentence pair using the BioSentVec model. We firstly use the above code examples to get vector representations of sentences. Then we compute the cosine similarity between the pair.

In [7]:
sentence_vector1 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.'))
sentence_vector2 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification are more aggressive, have a higher risk of CNS metastasis, and poorer prognosis.'))

cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print('cosine similarity:', cosine_sim)

cosine similarity: 0.9813870787620544


Here is another example for a pair that is relatively less similar.

In [8]:
sentence_vector3 = model.embed_sentence(preprocess_sentence('Furthermore, increased CREB expression in breast tumors is associated with poor prognosis, shorter survival and higher risk of metastasis.'))
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector3)
print('cosine similarity:', cosine_sim)

cosine similarity: 0.7300089001655579


#### 2. Universal Sentence Encoder

Compute the sentence similarity between the same sentence pairs using the Universal Sentence Encoder model. Apply the model to each sentence, to get vector representations of the individual sentences, then compute the inner product of the two vectors to determine the similarity.

In [31]:
sentence_vector1 = universal_sentence_encoder([preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.')])
sentence_vector2 = universal_sentence_encoder([preprocess_sentence('Breast cancers with HER2 amplification are more aggressive, have a higher risk of CNS metastasis, and poorer prognosis.')])
sim = np.inner(sentence_vector1, sentence_vector2)[0][0]
print('Similarity:', sim)

Similarity: 0.9569721


Here is the less similar pair:

In [32]:
sentence_vector3 = universal_sentence_encoder([preprocess_sentence('Furthermore, increased CREB expression in breast tumors is associated with poor prognosis, shorter survival and higher risk of metastasis.')])
sim = np.inner(sentence_vector1, sentence_vector3)[0][0]
print('Similarity:', sim)

Similarity: 0.5786929


## 7. Test Dataset 

### Initialization

In [9]:
from functools import lru_cache

# Levenshtein distance
def lev_dist(a, b):    
    
    @lru_cache(None)  # for memorization
    def min_dist(s1, s2):

        if s1 == len(a) or s2 == len(b):
            return len(a) - s1 + len(b) - s2

        # no change required
        if a[s1] == b[s2]:
            return min_dist(s1 + 1, s2 + 1)

        return 1 + min(
            min_dist(s1, s2 + 1),      # insert character
            min_dist(s1 + 1, s2),      # delete character
            min_dist(s1 + 1, s2 + 1),  # replace character
        )

    return min_dist(0, 0)

### 1. BIOSSES Test Dataset 

#### 1. BioSentVec

In [10]:
import docx
import numpy as np

doc = docx.Document('../test_data/BIOSSES-Dataset/Annotation-Pairs.docx')
tables = doc.tables
data = []

keys = None
for i, row in enumerate(tables[0].rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue

    row_data = dict(zip(keys, text))
    data.append(row_data)
cosine_similarity_list = []
for sentMap in data:
    sentence_vector1 = model.embed_sentence(preprocess_sentence(sentMap['Sentence 1']))
    sentence_vector2 = model.embed_sentence(preprocess_sentence(sentMap['Sentence 2']))
    cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
    cosine_similarity_list.append(cosine_sim)
cosine_similarity_score = np.mean(cosine_similarity_list)
print('BioSentVec similarity score:', cosine_similarity_score)


BioSentVec similarity score: 0.557557116150856


#### 2. Levenshtein Distance

In [11]:
import docx
import numpy as np

doc = docx.Document('../test_data/BIOSSES-Dataset/Annotation-Pairs.docx')
tables = doc.tables
data = []

keys = None
for i, row in enumerate(tables[0].rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue

    row_data = dict(zip(keys, text))
    data.append(row_data)
lev_dist_similarity_list = []
for sentMap in data:
    sentence_vector1 = preprocess_sentence(sentMap['Sentence 1'])
    sentence_vector2 = preprocess_sentence(sentMap['Sentence 2'])
    # Levenshtein distance
    lev_sim = lev_dist(sentence_vector1, sentence_vector2)
    lev_dist_similarity_list.append(lev_sim)
min_distance = np.min(lev_dist_similarity_list)

dist_list = []
for dist in lev_dist_similarity_list:
    dist_list.append(min_distance / dist)
lev_similarity_score = np.mean(dist_list)
print('Levenshtein similarity score:', lev_similarity_score)


Levenshtein similarity score: 0.31274520377736964


#### 3. Universal Sentence Encoder

In [33]:
import docx
import numpy as np

doc = docx.Document('../test_data/BIOSSES-Dataset/Annotation-Pairs.docx')
tables = doc.tables
data = []

keys = None
for i, row in enumerate(tables[0].rows):
    text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue

    row_data = dict(zip(keys, text))
    data.append(row_data)
sent_encoder_similarity_list = []
for sentMap in data:
    sentence_vector1 = universal_sentence_encoder([preprocess_sentence(sentMap['Sentence 1'])])
    sentence_vector2 = universal_sentence_encoder([preprocess_sentence(sentMap['Sentence 2'])])
    sim = np.inner(sentence_vector1, sentence_vector2)
    sent_encoder_similarity_list.append(sim)
similarity_score = np.mean(sent_encoder_similarity_list)
print('Universal Sentence Encoder similarity score:', similarity_score)

Universal Sentence Encoder similarity score: 0.36970225


### 2. MayoSRS Test Dataset 

#### 1. BioSentVec

In [12]:
import csv

file = open('../test_data/MayoSRS.csv')
csvreader = csv.reader(file)

cosine_similarity_list = []

for row in csvreader:
    sentence_vector1 = model.embed_sentence(preprocess_sentence(row[3]))
    sentence_vector2 = model.embed_sentence(preprocess_sentence(row[4]))
    cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
    cosine_similarity_list.append(cosine_sim)
file.close()
cosine_similarity_score = np.mean(cosine_similarity_list)
print('BioSentVec similarity score:', cosine_similarity_score)

BioSentVec similarity score: 0.2514136122977909


  dist = 1.0 - uv / np.sqrt(uu * vv)


#### 2. Levenshtein Distance

In [13]:
import csv

file = open('../test_data/MayoSRS.csv')
csvreader = csv.reader(file)

lev_dist_similarity_list = []
for row in csvreader:
    sentence_vector1 = preprocess_sentence(row[3])
    sentence_vector2 = preprocess_sentence(row[4])
    lev_sim = lev_dist(sentence_vector1, sentence_vector2)
    lev_dist_similarity_list.append(lev_sim)
min_distance = np.min(lev_dist_similarity_list)
file.close()

dist_list = []
for dist in lev_dist_similarity_list:
    dist_list.append(min_distance / dist)
lev_similarity_score = np.mean(dist_list)
print('Levenshtein similarity score:', lev_similarity_score)

Levenshtein similarity score: 0.10480677938548977


#### 3. Universal Sentence Encoder

In [34]:
import csv

file = open('../test_data/MayoSRS.csv')
csvreader = csv.reader(file)
sent_encoder_similarity_list = []

for row in csvreader:
    sentence_vector1 = universal_sentence_encoder([preprocess_sentence(row[3])])
    sentence_vector2 = universal_sentence_encoder([preprocess_sentence(row[4])])
    sim = np.inner(sentence_vector1, sentence_vector2)[0][0]
    sent_encoder_similarity_list.append(sim)
file.close()
similarity_score = np.mean(sent_encoder_similarity_list)
print('Universal Sentence Encoder similarity score:', similarity_score)

Universal Sentence Encoder similarity score: 0.32283983


### 3. UMNSRS_similarity Test Dataset 

#### 1. BioSentVec

In [14]:
import csv

file = open('../test_data/UMNSRS_similarity.csv')
csvreader = csv.reader(file)
cosine_similarity_list = []

for row in csvreader:
    sentence_vector1 = model.embed_sentence(preprocess_sentence(row[2]))
    sentence_vector2 = model.embed_sentence(preprocess_sentence(row[3]))
    cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
    cosine_similarity_list.append(cosine_sim)
file.close()
cosine_similarity_score = np.mean(cosine_similarity_list)
print('BioSentVec similarity score:', cosine_similarity_score)

BioSentVec similarity score: 0.23443180937339755


#### 2. Levenshtein Distance

In [15]:
import csv

file = open('../test_data/UMNSRS_similarity.csv')
csvreader = csv.reader(file)

lev_dist_similarity_list = []
for row in csvreader:
    sentence_vector1 = preprocess_sentence(row[2])
    sentence_vector2 = preprocess_sentence(row[3])
    lev_sim = lev_dist(sentence_vector1, sentence_vector2)
    if lev_sim != 0:
        lev_dist_similarity_list.append(lev_sim)
min_distance = np.min(lev_dist_similarity_list)

dist_list = []
for dist in lev_dist_similarity_list:
    dist_list.append(min_distance / dist)
lev_similarity_score = np.mean(dist_list)
file.close()
print('Levenshtein similarity score:', lev_similarity_score)

Levenshtein similarity score: 0.12853671720147514


#### 3. Universal Sentence Encoder

In [37]:
import csv

file = open('../test_data/UMNSRS_similarity.csv')
csvreader = csv.reader(file)
sent_encoder_similarity_list = []

for row in csvreader:
    sentence_vector1 = universal_sentence_encoder([preprocess_sentence(row[2])])
    sentence_vector2 = universal_sentence_encoder([preprocess_sentence(row[3])])
    sim = np.inner(sentence_vector1, sentence_vector2)[0][0]
    sent_encoder_similarity_list.append(sim)
file.close()
similarity_score = np.mean(sent_encoder_similarity_list)
print('Universal Sentence Encoder similarity score:', similarity_score)

Universal Sentence Encoder similarity score: 0.3192265


## References

C. Comito, D. Falcone and A. Forestiero, "AI-Driven Clinical Decision Support: Enhancing Disease Diagnosis Exploiting Patients Similarity," in IEEE Access, vol. 10, pp. 6878-6888, 2022, doi: 10.1109/ACCESS.2022.3142100.

Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. 2018. arXiv:1810.09302.