# Semantic Search with Vector DBs

We now understand the essentials of semantic similarity, but how do we scale something like this to millions or even billions of records?

For this we need a *vector database*. A vector database is a framework around something we call a *vector index*. The vector index is a data structure containing vectors like those we created in the previous chapter.

## Approximate Search

For the search through a vector index to be scalable it must use an *approximate nearest neighbors* structure. This structure can exist in many forms, but they all boil down to *approximating* the most similar vectors.

Why do we need to approximate the answer? Well, if we want to search through just one million vectors, using an *exact* search we must perform *one million* comparisons. At 1M items, this is just about doable, but as soon as we scale further it becomes very slow.

Therefore, we approximate the search space. Meaning we can scale to much larger indexes with ease.

<img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/vec-db-scale.png" style="width:70%">

Once we have an approximate vector index, there's still more needed to create a vector database.

## Vector Databases

Using a good vector database we should see data management capabilities like record insertion, deletion, updates, etc. We should find the ability to add *metadata* to vectors and filter the search space based on metadata. Ideally, we wouldn't want to manage all of this ourselves, and fortunately there are services that do this for us. We will be using the Pinecone vector database for this.

Using Pinecone we are able to store and search through upto 5M vectors on their free tier, more than enough for most use-cases (including ours).

## Implementation

For this chapter we will download a question-answering dataset, encode it with a `sentence-transformers` model, and index and search with the Pinecone vector database. Let's begin by installing the prerequsites.

In [None]:
!pip install -qq datasets pinecone-client sentence-transformers tqdm --extra-index-url https://download.pytorch.org/whl/cu113 torch

### Dataset

Before creating the vector database we need a dataset that we will encode and index. We will use the ***S**tanford **Qu**estion **A**nswering **D**ataset* via another Hugging Face library called `datasets`.

We download the dataset like so:

In [1]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
squad

Using the latest cached version of the module from /Users/jamesbriggs/.cache/huggingface/modules/datasets_modules/datasets/squad/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453 (last modified on Fri Sep  2 02:26:37 2022) since it couldn't be found locally at squad., or remotely on the Hugging Face Hub.
Reusing dataset squad (/Users/jamesbriggs/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

In [2]:
squad[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

The dataset contains many `(question, context)` pairs. Using an encoder model trained for question-answering we will be able to encode questions and their relevant contexts into a similar vector space. Due to the difference in the length of questions vs contexts this type of *semantic search* is known as ***asymmetric** semantic search*.

SQuAD contains duplicate contexts as each context can answer multiple questions, so we first deduplicate our contexts.

In [3]:
contexts = list(set(squad['context']))
len(contexts)

18891

We now have just ~19K contexts to index. The next step is encoding these contexts with a suitable sentence transformer.

### Embedding Model

As mentioned, we must use a question-answering (or QA) encoder model to build our vectors. To do this we use the [`multi-qa-MiniLM-L6-cos-v1`](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) model, we know it is suitable for QA because it has `qa` in the model name. We can also see that it is multilingual as per the `multi` in the model name, and that it should be used with cosine similarity as per the `cos` in the model name.

In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

As we'll be encoding a lot of contexts we should switch the model to use a GPU device if possible. This will speed up the process significantly.

In [8]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

print(device)

cpu


As before, we can encode a context to create a *context vector* like so:

In [7]:
xc = model.encode([contexts[0]])
xc.shape

(1, 384)

Outputting a *384*-dimensional vector. When encoding all of our vectors we will be continuously inserting these to our vector database, so we must now initialize that.

### Vector Database

We must sign up for a free API key from Pinecone at [app.pinecone.io](https://app.pinecone.io/). Once you have the API key, insert it below to initialize the connection to Pinecone.

In [None]:
import pinecone

pinecone.init(
    api_key='<<YOUR_API_KEY>>',
    environment='us-west1-gcp'
)

Then we create a single vector index using the parameters required by our model.

In [None]:
index_name = 'squad-demo'

pinecone.create_index(
    index_name,
    dimension=model.get_sentence_embedding_dimension(),
    metric='cosine'
)

And we then connect to the newly created index like so:

In [None]:
index = pinecone.Index(index_name)

# view index stats
index.describe_index_stats()

With that we're ready to begin encoding and inserting all of our contexts.

### Adding the Contexts

To add the contexts we will work through everything in batches of `64`. We encode a batch of 64 contexts, insert those 64 context vectors to Pinecone, and move on to the next batch.

In [None]:
from tqdm.auto import tqdm

batch_size = 64

for i in tqdm(range(0, len(contexts), batch_size)):
    # get index for end of batch
    i_end = min(i + batch_size, len(contexts))
    batch = contexts[i:i_end]
    # create context vectors
    vecs = model.encode(batch).tolist()
    # make unique IDs for each context vector
    ids = [f'{i}' for i in range(i, i_end)]
    # add context text and title to metadata
    metadata = [{
        'context': c, 'title': squad['title'][i]
    } for i, c in enumerate(batch)]
    # create list of items to add to index
    to_upsert = zip(ids, vecs, metadata)
    index.upsert(vectors=to_upsert)

# view index stats
index.describe_index_stats()

We've now indexed all of our contexts and we can move on to *querying*.

### Querying

The query process is almost identical to the context encoding process. We use the QA model to encode a query, then use that to search through the vector database.

In [None]:
query = "why is Albert Einstein famous?"

xq = model.encode(query).tolist()

Then we query with `index.query`. We will find the top `5` most similar matches using the `top_k` parameter, and return the related metadata with `include_metadata=True`.

In [9]:
matches = index.query(xq, top_k=5, include_metadata=True)
matches

NameError: name 'index' is not defined