Skip to content
Calculate cosine similarities between sentences using BERT pre-trained models
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
BERT_sentence_similarity.py
README.md

README.md

The BERT pre-trained models can be used for more than just question/answer tasks. See
https://www.ironmanjohn.com/home/question-and-answer-for-long-passages-using-bert for more details. They can also be used to determine how similar two sentences are to each other.

In this repo demonstrate how to find these similarities using a measure known as cosine similarity.
I do some very simple testing using 3 sentences that I have tokenized manually.
If using a larger corpus, you will definitely want to have the sentences tokenized using something like nltk.tokenize.

The first two sentences (0 and 1) come from the same blog entry (https://www.ironmanjohn.com/home/question-and-answer-for-long-passages-using-bert), while the third (2) comes from a separate blog entry (https://www.ironmanjohn.com/home/building-an-asl-vision-ai-model-at-supercompute-2019).
The similarity between sentences 0 and 1 should be higher with each other than with sentence 2.

The sentences are: 0. BERT was developed by Google and Nvidia has created an optimized version that uses TensorRT

  1. One drawback of BERT is that only short passages can be queried
  2. I attended a conference in Denver

For a more detailed explanation, see my full blog post at ironmanjohn.com and on Medium at medium.com.

You can’t perform that action at this time.