In [None]:
# default_exp neural_search

# Neural Search Engine with transformers

> This tutorial demonstrates how to create a search engine with transformers.

In [None]:
#hide
from nbdev.showdoc import *

The diagram below shows the architecture of the system:
<!-- <img src="images/semantic_search_diagram.png" width="600" height="500" /> -->

<img src="images/semantic_search_diagram.png" width="600" height="500" />

In [None]:
#export
import pandas as pd
import numpy as np
import faiss
import torch

from sentence_transformers import SentenceTransformer

## Step 0: Collect a dataset
I used a dataset that is about startups. Dataset is in json format and each record includes the name, a paragraph describing the company, the location and a picture. The dataset is available at [this link](https://storage.googleapis.com/generall-shared-data/startups_demo.json).

In [None]:
df = pd.read_json("https://storage.googleapis.com/generall-shared-data/startups_demo.json", lines=True)

In [None]:
df.head(3)

Unnamed: 0,name,images,alt,description,link,city
0,SaferCodes,https://safer.codes/img/brand/logo-icon.png,SaferCodes Logo QR codes generator system form...,QR codes systems for COVID-19.\nSimple tools f...,https://safer.codes,Chicago
1,Human Practice,https://d1qb2nb5cznatu.cloudfront.net/startups...,Human Practice - health care information tech...,Point-of-care word of mouth\nPreferral is a mo...,http://humanpractice.com,Chicago
2,StyleSeek,https://d1qb2nb5cznatu.cloudfront.net/startups...,StyleSeek - e-commerce fashion mass customiza...,Personalized e-commerce for lifestyle products...,http://styleseek.com,Chicago


We only implement a search mechanism on the `description` column, i.e. we search and find similar companies based on the similarity of the *search query* and *descriptions*.

In [None]:
corpus = df.description.tolist()
print(f"Total number of documents: {len(corpus)}")

Total number of documents: 40474


> Note: Depending on the quality of the documents, we may need to perform some preprocessing such as removing special characters, digits, etc.

## Step 1: Create dense vectors of documents (i.e. document embeddings).
We need to have an embedding model to create embeddings of our text documents. We use a pretrained language model from the [Sentence Transformers](https://www.sbert.net/docs/pretrained_models.html), specifically we utilize `all-distilroberta-v1` model as it works very well for semantic search applications.

In [None]:
# Instantiate the model. You can set the device to `cpu` if don't have access to `gpu`.
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1', device='cuda')

# convert documents into embeddings
corpus_embeddings = model.encode(corpus, show_progress_bar=True)

Batches:   0%|          | 0/1265 [00:00<?, ?it/s]

## Step 2: Store the embeddings or index the embeddings
In order to be able to perform search and find documents, we need to store document embeddings in a document store. In other words, we have to `index` them. There are several different ways to do that, nevertheless, I work with Faiss for now.
Faiss allows us to search through billions of vectors very efficiently. For complete information about Faiss, please check their [wiki](https://github.com/facebookresearch/faiss/wiki) page or read their [paper](https://arxiv.org/abs/1702.08734).

Faiss is built around the `Index` object, which contains searchable vectors. Faiss handles collections of vectors of a fixed dimensionality `d`, typically a few 10s to 100s. 

> Faiss uses only 32-bit floating point matrices. This means we will have to change the data type of the input before building the index.

In [None]:
# Convert the data type of the embeddings into float32.
corpus_embeddings = np.array([embedding for embedding in corpus_embeddings]).astype("float32")

In [None]:
# Build the index
# Shape of embeddings is (40474, 768), so we set the dimension of index to 768.
index = faiss.IndexFlatL2(corpus_embeddings.shape[1])

# Add the document vectors into the index
index.add(corpus_embeddings)

## Embed the search query
Before we search for a query, we must convert the search query into an embedding using the same model we used for document embeddings.

In [None]:
search_query = "High-Quality Men's Accessories"

# Embed the query
query_embedding = model.encode([search_query])

## Step 4: Perform the search

In [None]:
# We're interested in top-10 most similar documents
top_k = 10

# Search function returns two arrays, Distances of the nearest neighbors with shape (n, k), and Labels/ids of the nearest neighbors with shape (n, k).
distances, ids = index.search(np.array(query_embedding), k=top_k)
print(distances, ids)

[[0.6489439  0.7083432  0.75166255 0.8019202  0.8154081  0.82068354
  0.8472285  0.86972404 0.8702345  0.87548184]] [[24969 17175  6105 28803 15934  1629 15866 19206  5640 28549]]


## Step 5: Display the search results

In [None]:
for id in ids[0]:
    print(corpus[id])
    print("===========")

Seasonal Subscription for Men's Accessories
Like looking good but hate shopping for good looking things? Each season, we'll ship you a package of ridiculously cool gear from the hottest up and coming brands.
You'll never again worry about finding hats, belts, shades, or ties that match the stuff already ...
Leather Accessories for Men
Premium leather accessories for men.
Affordable custom-tailored mens clothing online
Fully customizable menswear
celebrity driven menswear that understands the moments in a gentleman's life
designer tailored casual wear and high end luxury suiting and shirting for men
The Manliest Shop on the Web
- Fitness
- Gadgets
- WTF
- Apparel
- Electronics
and more!
All for Men!
Men's Lifestyle Brand
A men's lifestyle brand offering consumers the highest quality products with a fashionable and sophisticated design aesthetic. Gents launched in November 2012 with men's luxury baseball caps, apparel, and accessories, and an e-commerce custom-build experience.
Badass Cu

In [None]:
#hide
# class FaissNeuralSearch:
#     """ This class represents a neural search that uses faiss as the index for search. """
    
#     def __init__(self, model, corpus):
#         self.model = model
#         self.corpus = corpus