# Semantic Search
We may already be familiar with keyword-based search (Boolean model), where, for a given keyword or pattern, we can retrieve the results that match the pattern. Alternatively, we can use regular expressions, where we can define advanced patterns such as the lexico-syntactic pattern. These traditional approaches cannot handle synonym (for example, car is the same as automobile) or word sense problems (for example, bank as the side of a river or bank as a financial institute). While the first synonym case causes low recall due to missing out the documents that shouldn't be missed, the second causes low precision due to catching the documents not to be caught. Vector-based or semantic search approaches can overcome these drawbacks by building a dense numerical representation of both queries and documents.

## Case Study: Transform idle FAQ to Question Answering Model

Let's set up a case study for Frequently Asked Questions (FAQs) that are idle on websites. We will exploit FAQ resources within a semantic search problem. FAQs contain frequently asked questions. We will be using the FAQ from the World Wide Fund for Nature ([WWF](https://www.wwf.org.uk/)), a nature non-governmental organization .

Given these descriptions, it is easy to understand that performing a semantic search using semantic models is very similar to a one-shot learning problem, where we just have a single shot of the class (a single sample), and we want to reorder the rest of the data (sentences) according to it. You can redefine the problem as searching for samples that are semantically close to the given sample, or a binary classification according to the sample. Your model can provide a similarity metric, and the results for all the other samples will be reordered using this metric. The final ordered list is the search result, which is reordered according to semantic representation and the similarity metric.

In [None]:
# !pip install sentence-transformers

In [1]:
import pandas as pd
import sklearn
import numpy as np

https://www.wwf.org.uk/

World Wide Fund for Nature
Non-governmental organization

In [2]:
wwf_faq=["I haven’t received my adoption pack. What should I do?",
         "How quickly will I receive my adoption pack?",
         "How can I renew my adoption?",
         "How do I change my address or other contact details?",
         "Can I adopt an animal if I don’t live in the UK?",
         "If I adopt an animal, will I be the only person who adopts that animal?",
"My pack doesn't contain a certicate",
"My adoption is a gift but won’t arrive on time. What can I do?",
"Can I pay for an adoption with a one-off payment?",
"Can I change the delivery address for my adoption pack after I’ve placed my order?",
"How long will my adoption last for?",
"How often will I receive updates about my adopted animal?",
"What animals do you have for adoption?",
"How can I nd out more information about my adopted animal?",
"How is my adoption money spent?",
"What is your refund policy?",
"An error has been made with my Direct Debit payment, can I receive a refund?",
"How do I change how you contact me?"]

In [3]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("quora-distilbert-base")

Downloading:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/540 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/490 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [4]:
faq_embeddings = model.encode(wwf_faq)

In [5]:
test_questions=["What should be done, if the adoption pack did not reach to me?",
                " How fast is my adoption pack delivered to me?",
                "What should I do to renew my adoption?",
        "What should be done to change adress and contact details ?",
      "I live outside of the UK, Can I still adopt an animal?"]
test_q_emb= model.encode(test_questions)

In [6]:
from scipy.spatial.distance import cdist
for q, qe in zip(test_questions, test_q_emb):
    distances = cdist([qe], faq_embeddings, "cosine")[0]
    ind = np.argsort(distances, axis=0)[:3]
    print("\n Test Question: \n "+q)
    for i,(dis,text) in enumerate(zip(distances[ind], [wwf_faq[i] for i in ind])):
        print(dis,ind[i],text, sep="\t")


 Test Question: 
 What should be done, if the adoption pack did not reach to me?
0.1494579210985294	0	I haven’t received my adoption pack. What should I do?
0.24940211500660137	7	My adoption is a gift but won’t arrive on time. What can I do?
0.36697596844102975	1	How quickly will I receive my adoption pack?

 Test Question: 
  How fast is my adoption pack delivered to me?
0.16582392910695054	1	How quickly will I receive my adoption pack?
0.3470479399195845	0	I haven’t received my adoption pack. What should I do?
0.35111153234452375	7	My adoption is a gift but won’t arrive on time. What can I do?

 Test Question: 
 What should I do to renew my adoption?
0.04168249099787469	2	How can I renew my adoption?
0.2993019127159663	12	What animals do you have for adoption?
0.30140717858980404	0	I haven’t received my adoption pack. What should I do?

 Test Question: 
 What should be done to change adress and contact details ?
0.2766019694529397	3	How do I change my address or other contact detail

For the deployment, we can design the following getBest() function, which takes a question and returns K most similar questions in the FAQ:

In [7]:
def get_best(query, K=3):
    query_embedding = model.encode([query])
    distances = cdist(query_embedding, faq_embeddings, "cosine")[0]
    ind = np.argsort(distances, axis=0)
    print("\n"+query)
    for c,i in list(zip(distances[ind],  ind))[:K]:
        print(c,wwf_faq[i], sep="\t")

In [8]:
get_best("How do I change my contact info?",3)


How do I change my contact info?
0.056767933031530604	How do I change my address or other contact details?
0.18566549461599935	How do I change how you contact me?
0.3240832894105864	How can I renew my adoption?


What if a question that's used as input is not similar to one from the FAQ? Here is such a question

In [9]:
get_best("How do I get my plane ticket if I bought it online?")


How do I get my plane ticket if I bought it online?
0.35947507166215176	How do I change how you contact me?
0.36807847141978456	How do I change my address or other contact details?
0.4306633545016928	My adoption is a gift but won’t arrive on time. What can I do?


The best dissimilarity score is 0.35. So, we need to define a threshold such as 0.3 so that the model ignores such questions that are higher than that threshold and says no similar answer found.

# Further Reading
Please refer to the following works/papers for more information about the topics that were covered in this chapter:

- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Pushp, P. K., & Srivastava, M. M. (2017). Train once, test anywhere: Zero-shot learning for text classification. arXiv preprint arXiv:1712.05972.
- Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Williams, A., Nangia, N., & Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.  
- Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Kurzweil, R. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.
- Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., ... & Kurzweil, R. (2019). Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307.
- Humeau, S., Shuster, K., Lachaux, M. A., & Weston, J. (2019). Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.