# Module 1 : Exploring BM25 similiarity and Semantic similiarity

Before we get started with Amazon OpenSearch and our search web app, let's explore some of the core concepts in search. Below, we'll demonstrate the different between algorithms for matching data using BM25 similarity (keyword matching) and Cosine similarity (sematnic vector matching).

### 1. Upgrade PyTorch and restart Kernel

Before we begin, we need to upgrade PyTorch and restart the notebook kernel. The following should take 2-3 minutes to complete, and you should see the following message::"Successfully intalled torch-1.nn.n".

You may see a message with stating "ERROR: pip's dependency resolver does not..." - you can ignore this error.

In [None]:
!pip install --upgrade torch

Now we need to restart the kernel.

In [None]:
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

Next, let's verify the version of Torch to ensure everything is up to date. The version should be 1.10.2 or higher.

In [None]:
import torch
print(torch.__version__)

### 2. Install Pre-Requisites

Before we can experiment with different searches, we need to install some required libraries.

In [None]:
!pip install -q transformers
!pip install -U sentence-transformers rank_bm25

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np

### 3. Create a sample dataset

Let's now create a very simple dataset as an array of 4 questions.

In [None]:
passages=["does this work with xbox?",
          "Does the M70 work with Android phones?", 
          "does this work with iphone?",
          "Can this work with an xbox "
         ]

### 4. Explore BM25 similiarity 

Execute the following to explore BM25 similarity. First, we'll tokenize the data set, then use BM25 similarity to compare the phrase "does this work with xbox?" with our sample questions. 

In [None]:
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

bm25_scores = bm25.get_scores(bm25_tokenizer("does this work with xbox?"))

all_sentence_combinations = []
for i in range(len(bm25_scores)):
    all_sentence_combinations.append([bm25_scores[i], i])

all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top most similar pairs:")
for score, i in all_sentence_combinations[0:4]:
    print("{} \t {} \t {:.4f}".format(passages[i],bm25_tokenizer(passages[i]),bm25_scores[i]))
    


### 5. Semantic Similiarities


Execute the following to explore semantic similarity with cosine similarity. In this code, we'll use the same dataset as above, but using cosine similarity. Compare the differences in how matches are ranked.

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Encode all sentences
embeddings = model.encode(passages)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#cosine similarity score with query
all_sentence_combinations = []
for i in range(len(cos_sim)):
    all_sentence_combinations.append([cos_sim[0][i], i])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top most similar pairs:")
for score, i in all_sentence_combinations[0:4]:
    print("{} \t {:.4f}".format(passages[i],cos_sim[0][i]))

### 6. Compare the differences.

As you can see, the similarity is significantly different, even with with a trivial data set.

In this module, we've used fairly simple steps with a very small dataset to demonstrate the difference between BM25 and cosine similarity. In the following modules, we'll demonstrate these same concepts with using OpenSearch and a larger and more complex dataset.