# Semantic Search

This notebook presents the use of `SemanticSearch` class. `SemanticSearch` class makes use of pretrained BERT-based transformers `SentenceTransformers` specially tuned to perform semantic search. 

The process to perform a semantic search is as follow:

1. Encode corpus of text to a vector space.
2. Encode query text to the same vector space.
3. Find the sentences in the corpus that are most similar to the query text by means of cosine similarity score.


## Installation & Setup

Install the required packages.

In [1]:
#!pip install -U sentence-transformers
#!pip install torch

## Configuration

All required parameters to configure the model can be found and adjunsted in  `config.json`

In [2]:
import json

config = json.load(open('../semantic_search/config.json', 'r'))
for key in config.keys():
  print(f"{key}: '{config[key]}'")

MODEL_NAME: 'paraphrase-distilroberta-base-v1'
MODEL_PATH: '../model'
MODEL_DOWNLOAD: 'True'
CORPUS_TEXT_PATH: '../data/corpus_text.pkl'
CORPUS_ENCODED_PATH: '../data/corpus_encoded.pkl'
TRAIN: '{'LOGGER_FORMAT': '%(asctime)s - %(message)s', 'LOGGER_DATE_FORMAT': '%Y-%m-%d %H:%M:%S', 'DATASET_PATH': 'quora-IR-dataset', 'DATASET_CLASSIFICATION_PATH': 'classification/train_pairs.tsv', 'DATASET_MINING_CORPUS_PATH': 'duplicate-mining/dev_corpus.tsv', 'DATASET_MINING_DUPLICATES_PATH': 'duplicate-mining/dev_duplicates.tsv', 'DATASET_INFORMATION_RETRIEVAL_QUERIES_PATH': 'information-retrieval/dev-queries.tsv', 'DATASET_INFORMATION_RETRIEVAL_CORPUS_PATH': 'information-retrieval/corpus.tsv', 'ZIP_SAVE_PATH': 'quora-IR-dataset.zip', 'DATASET_DOWNLOAD_URL': 'https://sbert.net/datasets/quora-IR-dataset.zip', 'BASE_MODEL_NAME': 'stsb-distilbert-base', 'MODEL_SAVE_PATH': 'output/training_multi-task-learning', 'NUM_EPOCHS': 10, 'WARMUP_STEPS': 1000, 'TRAIN_BATCH_SIZE': 64, 'DISTANCE_METRIC': 'COSINE_

## SemanticSearch Class

Import and create an instance of SemanticSearch class

In [3]:
from model import SemanticSearch

ss = SemanticSearch()

## Encode Corpus

All sentences in the corpus are encoded to a vector space and stored to a local pkl file for later use.

In [4]:
corpus_text = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]

ss.update_corpus(corpus_text)

print(f"Corpus original text is stored at '{ss.config['CORPUS_TEXT_PATH']}'")
print(f"Corpus encoded text is stored at '{ss.config['CORPUS_ENCODED_PATH']}'")

Corpus original text is stored at './corpus_text.pkl'
Corpus encoded text is stored at './corpus_encoded.pkl'


## Perform Queries

Given one or many sentences from the customer, find the most similar sentences in the corpus.

In [5]:
sentences = ['A man is eating pasta.', 
             'Someone in a gorilla costume is playing a set of drums.', 
             'A cheetah chases prey on across a field.']

for sentence in sentences:
  hits = ss.search(sentence)
  print('\nQuery: ', sentence)
  print('Hits:')
  for i, key in enumerate(hits.keys()):
    print(f'[#{i+1}] Sentence: {key} --> Similarity score: {hits[key]}')



Query:  A man is eating pasta.
Hits:
[#1] Sentence: A man is eating food. --> Similarity score: 0.71
[#2] Sentence: A man is eating a piece of bread. --> Similarity score: 0.61
[#3] Sentence: A man is riding a horse. --> Similarity score: 0.34

Query:  Someone in a gorilla costume is playing a set of drums.
Hits:
[#1] Sentence: A monkey is playing drums. --> Similarity score: 0.68
[#2] Sentence: A woman is playing violin. --> Similarity score: 0.38
[#3] Sentence: A man is riding a horse. --> Similarity score: 0.31

Query:  A cheetah chases prey on across a field.
Hits:
[#1] Sentence: A cheetah is running behind its prey. --> Similarity score: 0.78
[#2] Sentence: A monkey is playing drums. --> Similarity score: 0.28
[#3] Sentence: A man is riding a white horse on an enclosed ground. --> Similarity score: 0.22


## Finetuning

The model can be finetuned by retraining with different datasets and training parameters. Training parameters can be found in `config["TRAIN"]`. To retrain the model, adjust the training parmeters accordingly and run the following script:

In [2]:
!python train.py

## REFERENCE

More information on `SentenceTransformes` can be found [here](https://www.sbert.net/index.html)