#Semantic Search using SBERT on Quora Questions dataset

This script contains an example how to perform semantic search with PyTorch. It performs exact nearest neighborh search.

As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions (we only use about 100k):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs


As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).


In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 6.1MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 11.4MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 50.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp3

In [None]:
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time
import torch

if not torch.cuda.is_available():
  print("Warning: No GPU detected. Processing will be slow. Please add a GPU to this notebook")

model_name = 'quora-distilbert-multilingual'
model = SentenceTransformer(model_name)

url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        if len(corpus_sentences) >= max_corpus_size:
            break

        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)



###############################
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))



100%|██████████| 501M/501M [00:27<00:00, 18.0MB/s]


Download dataset


100%|██████████| 58.2M/58.2M [00:00<00:00, 95.0MB/s]


Encode the corpus. This might take a while


HBox(children=(FloatProgress(value=0.0, description='Batches', max=3125.0, style=ProgressStyle(description_wid…


Corpus loaded with 100000 sentences / embeddings


In [None]:
# Function that searches the corpus and prints the results
def search(inp_question):
    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings)
    end_time = time.time()
    hits = hits[0]  #Get the hits for the first query

    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time-start_time))
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))



In [None]:
search("How can I learn Python online?")

Input question: How can I learn Python online?
Results (after 0.332 seconds):
	0.987	What is the best online resources to learn Python?
	0.980	Where I should learn Python?
	0.980	What's the best way to learn Python?
	0.980	How do I learn Python in easy way?
	0.980	What can I do if I learn Python?


In [None]:
#German: How can I learn Python online?
search("Wie kann ich Python online lernen?")

Input question: Wie kann ich Python online lernen?
Results (after 0.233 seconds):
	0.989	What is the best online resources to learn Python?
	0.980	Where I should learn Python?
	0.980	Where should I start at to learn about how to do Python?
	0.980	Which are the best resources to learn Python?
	0.979	What's the best way to learn Python?


In [None]:
#Chinese: How can I learn Python online?
search("如何在线学习Python")

Input question: 如何在线学习Python
Results (after 0.239 seconds):
	0.983	What is the best online resources to learn Python?
	0.982	Where can I learn web scraping using Python?
	0.982	Where should I start at to learn about how to do Python?
	0.981	What are some good resources to learn web scraping with Python?
	0.980	How do I learn Python systematically?
