<a href="https://colab.research.google.com/github/kjahan/semantic_similarity/blob/main/examples/colab/neeva_q2q_similarity_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Query2Query Similarity

https://neeva.com/blog/state-of-the-art-query2query-similarity

`query2query`

**This is a sentence-transformers model: It maps queries to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search over queries.**

In [2]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.1 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 40.4 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 41.5 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 52.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.9 MB/s 
Building wheels for collected p

## Load q2q model

In [3]:
from sentence_transformers import SentenceTransformer, util


q2q_model = SentenceTransformer('neeva/query2query')


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/664 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/477 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Test model

In [13]:
queries = ["flight cost from nyc to la", "ticket prices from nyc to la"]

embeddings = q2q_model.encode(queries)
# Compute cosine-similarits
cosine_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).detach().numpy()[0][0]

print(cosine_score)

0.9671601


## Load SBERT model

In [4]:
sbert_model = SentenceTransformer('paraphrase-MiniLM-L12-v2')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/631 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

## Try SBERT

In [11]:
queries = ["flight cost from nyc to la", "ticket prices from nyc to la"]

embeddings = sbert_model.encode(queries)
# Compute cosine-similarits
cosine_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).detach().numpy()[0][0]

print(cosine_score)

0.85507935


## Test Quora examples

Let's test their test data

https://www.kaggle.com/competitions/quora-question-pairs/data?select=test.csv.zip

In [14]:
#"1075","2144","2145","Where can I get free books to read or download?","Where can I get free books?","1"
queries = ["Where can I get free books to read or download?","Where can I get free books?"]


embeddings = q2q_model.encode(queries)
# Compute cosine-similarits
cosine_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).detach().numpy()[0][0]

print(cosine_score)

0.770394


In [15]:
queries = ["Where can I get free books to read or download?","Where can I get free books?"]

embeddings = sbert_model.encode(queries)
# Compute cosine-similarits
cosine_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).detach().numpy()[0][0]

print(cosine_score)

0.9250375


In [5]:
def compute_cosine_sim(queries):
  embeddings = q2q_model.encode(queries)
  # Compute cosine-similarits
  q2q_cosine_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).detach().numpy()[0][0]

  embeddings = sbert_model.encode(queries)
  # Compute cosine-similarits
  sbert_cosine_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).detach().numpy()[0][0]

  #print(cosine_score)
  return q2q_cosine_score, sbert_cosine_score

In [23]:
queries = ["what is 2+2?","what is 2+3?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.6891094446182251
sbert cosine: 0.921252965927124


In [24]:
queries = ["what is the united states population?", "how many people live in united states?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.9315909147262573
sbert cosine: 0.88234543800354
sbert cosine: 0.88234543800354


In [25]:
queries = ["How can I reduce my belly fat through a diet?","How can I reduce my lower belly fat in one month?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.519539475440979
sbert cosine: 0.8516068458557129


In [6]:
queries = ["a structure that contains DNA, the genetic material that is passed from one generation to the next", "what is chromosome?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))


q2q cosine: 0.2742469012737274
sbert cosine: 0.5903246402740479


In [7]:
queries = ["which one of the following statements does not represent an advantage of using viral marketing campaigns?", 
           "which of the following statements is true of viral marketing?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.7035866975784302
sbert cosine: 0.8741922378540039


In [8]:
queries = ["What kind of food do rich people like Bill Gates, Warren Buffet, etc. generally classic?", 
           "What be rich first to practice altruism? Like Bill Gates and Warren Buffet"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.6161669492721558
sbert cosine: 0.5818869471549988


In [9]:
queries = ["Are we humans or slaves of our tell desires?","To whom are other we the human beings slaves?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.49185067415237427
sbert cosine: 0.7112250924110413


In [10]:
queries = ["If I step 240 volts AC to 120 volts AC, and rectify it to DC. What will be the voltage and amperage the DC output?", 
           "I am working in an IT company with 9 hours side work. Is it possible for me to crack the GATE in electrical engineering?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.13593006134033203
sbert cosine: 0.21269062161445618


In [11]:
queries = ["Why make I get scared so easily?", "Why do get scared?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.7263717651367188
sbert cosine: 0.8386101126670837


In [12]:
queries = ["What is the difference between CA and MBA?","What is the difference between: CA & CFA?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.5242831110954285
sbert cosine: 0.6357532739639282


In [13]:
queries = ["How can I object masturbating?", "How simple I stop fapping forever?"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.22586457431316376
sbert cosine: 0.31829962134361267


In [14]:
queries = ["how to convert a list of dictionaries to dataframe in python", "Convert list of dictionaries to a pandas DataFrame"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.9721435308456421
sbert cosine: 0.7675014138221741


In [15]:
queries = ["how to convert a list of dictionaries to dataframe in python", "How to convert a list of dictionaries to a Pandas DataFrame"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.9729502201080322
sbert cosine: 0.7719252109527588


In [16]:
queries = ["how to convert a list of dictionaries to dataframe in python", "Trying to convert list of dictionaries to a pandas dataframe"]


q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.9636967778205872
sbert cosine: 0.7061135768890381


In [17]:
queries = ["how to convert a list of dictionaries to dataframe in python", "Converting a list of lists of dictionaries to a Pandas DataFrame"]

q2q_cosine_score, sbert_cosine_score = compute_cosine_sim(queries)
print("q2q cosine: {}".format(q2q_cosine_score))
print("sbert cosine: {}".format(sbert_cosine_score))

q2q cosine: 0.8564285635948181
sbert cosine: 0.7714899778366089
