# Do you really need semantic search? Part I

## Semantic search vs. lexical search with query expansion

For those of us whose path to machine learning is not rooted in more traditional computer science, it is easy to ignore the fact that long before text embeddings and the resulting similarity search became a thing, the field of information retrieval was not only alive and well, but rather mature. Or that even if on the surface the choice between semantic and lexical (i.e. keyword-based) search seems like a no-brainer, the question of whether one is consistently more performant than the other appears to be an open one.

I do not claim to have arrived at a conclusion, but here is the notebook that I used to run a few experiments that you are welcome to reproduce or tweak however you see fit. Namely, below I:

* Put together a basic retriever for a RAG (Retrieval Augmented Generation) system that queries one's documents.

* Test the retriever in two different scenarios: semantic search, and lexical search with LLM-powered query expansion. Here the idea is that one of the main advantages of semantic search over its keyword-based counterpart is that we don't have to rely on the exact word matches between the query and the retrieved documents. But, as great as that is, there is another tried-and-tested approach aimed at the same goal: expanding the query by adding related words to it. (And unlike in the *old times*, we can now use LLMs to get the job done.)

> I provide just one example below, but I have tried a handful of queries on my own data, and I must say, I did not see semantic search outperform BM25 (a popular keyword search algorithm), once the query has been expanded.

* To get some hard numbers, in Part II that is to follow, I also benchmark different retrieval approaches (semantic vs. hybrid vs. lexical, with and without query expansion) on several open source datasets.

## 0. Setup

First we are going to install a few packages: namely, **LangChain** for loading and chunking up the documents, **PyPDF** in case the said documents are PDFs, **sentence-transformers** for the semantic embeddings, **pinecone-client** and **pinecone-text** since I am using Pinecone for the vector storage, as well as their hybrid (semantic + keyword-based) search, and **openAI** for the LLM that will perform the query expansion.

(FYI I ran these experiments in Google Colab, so a number of packages I am using came preinstalled.)

In [None]:
!pip install -U langchain pypdf sentence-transformers pinecone-client pinecone-text openai

Collecting langchain
  Downloading langchain-0.1.9-py3-none-any.whl (816 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/817.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m450.6/817.0 kB[0m [31m13.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.0.2-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-2.4.0-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.5/149.5 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pinecone-client
  Downloading pinecone_client-3.1.0-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━

Now the imports:

In [None]:
import torch

from google.colab import drive

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain_community.document_loaders import PyPDFLoader

from pinecone import Pinecone, PodSpec
from pinecone_text.sparse import BM25Encoder

from tqdm.auto import tqdm

from sentence_transformers import SentenceTransformer

## 1. Data Indexing

Our first step in building the data retriever is indexing the data. Now, before we do that, data has to be loaded. Since I am already using Google Colab, I've put the documents I will be quering in a folder on my Google drive. Let's mount it:

In [None]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### 1.1 Loading and chunking

Let's go with a common use case: PDF files. Parsing text from PDFs can be a nontrivial task in itself, but let's say that our PDFs are reasonably nice - no scanned coffee-stained pages that one would need to preprocess and do OCR on, and that the information that we are interested in quering appears as unstructured text, rather than images or, say, tables. Personally I filled my directory with arXiv preprints. The only kind of additional processing that I'll do on these files is taking care of the most common *ligatures* - special characters that are combinations of other, less special, characters.

**LangChain** is an open-source framework for building LLM-powered applications. Here I am only going to use it for loading and slicing up the data into chunks that will then be indexed. I will use the *DirectoryLoader* and set the *loader_cls* argument to *PyPDFLoader* because I like to keep track of the PDF pages that each chunk came from (otherwise *DirectoryLoader*'s default *UnstructuredLoader* would work just fine). The *load_pdfs(dir_name)* function thus returns a list of *langchain_core.documents.base.Document* objects - one *Document* per page for all the PDFs in the *dir_name* directory. Each *Document* has *page_content* (in the form of a string) and *metadata*, a dictionary with *source* (the original filename) and *page*.


In [None]:
def decompose_ligatures(text):

  ligatures = {
    "ﬀ": "ff",
    "ﬁ": "fi",
    "ﬂ": "fl",
    "ﬃ": "ffi",
    "ﬄ": "ffl"
  }

  for search, replace in ligatures.items():
    text = text.replace(search, replace)

  return text


def load_pdfs(dir_name):

    loader = DirectoryLoader(dir_name, glob='**/*.pdf', loader_cls=PyPDFLoader, show_progress=True)
    pdfs = loader.load()

    for pdf in pdfs:
      pdf.page_content = decompose_ligatures(pdf.page_content)

    return pdfs

pages = load_pdfs('/content/gdrive/MyDrive/test_RAG')
len(pages)

100%|██████████| 3/3 [00:04<00:00,  1.44s/it]


31

The next step is to embed the documents into fixed-size vectors. Before we do that though, we will split our pages up into smaller chunks of text - this should help the embedding model capture the meaning of each chunk more accurately, as well as (hopefully) retrieve more relevant pieces of text for our RAG system.

A little sidenote: the chunking strategy can have a substantial effect on the retriever's performance. Ideally you would want to avoid splitting up text that, semantically, should be kept together (e.g. Python functions if the text in question is code, or, perhaps, paragraphs of text... Unless the paragraph is long, then you'd want to split it. Basically, keep trying until you arrive at a solution that works.) When text is automatically split into chunks of a given size, often one factors in some overlap between the neighboring chunks, to minimize information that may be lost due to breaking up passages.

Now let me ignore my own advice and go with the simplest possible chunking option (which nevertheless turns out to be sufficient for our purposes).

Since I already decided I'll be using **SentenceTransformers** for the document embeddings, I will also make use of LangChain's *SentenceTransformersTokenTextSplitter*. You can pass one of the SentenceTransformers names (e.g. *'paraphrase-MiniLM-L6-v2'*, which results in 384 dimensional embedding vectors) as an argument, and the text splitter will automatically produce chunks that correspond to the token window of that particular model.

In [None]:
def split_pdfs(pdfs):
    text_splitter = SentenceTransformersTokenTextSplitter(model_name='paraphrase-MiniLM-L6-v2')
    chunks = text_splitter.split_documents(pdfs)
    return chunks

chunks = split_pdfs(pages)
len(chunks)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

438

Just like *pages*, *chunks* are LangChain *Document* objects with the same attributes - except that you will supposedly have more of them, assuming the chunk size is smaller than the original pages. While we are at it, I will extract the text from the chunks into a separate list called *corpus* that I will use to initialize the encoder for the lexical search:

In [None]:
corpus = [chunk.page_content for chunk in chunks]

bm25 = BM25Encoder()
bm25.fit(corpus)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


  0%|          | 0/438 [00:00<?, ?it/s]

<pinecone_text.sparse.bm25_encoder.BM25Encoder at 0x7d92331fb130>

### 1.2 Embeddings: semantic (dense) and lexical (sparse)

**BM25** stands for "Best Match 25". It is an improvement over the basic TF-IDF (term frequency - inverse document frequency) search, which would reward documents that contain a greater number of the query term occurances, but penalize those terms that appear in a lot of documents in the text corpus (basically, you don't want the search results to be thrown off by words like "the", "it", "an", etc, hence the inverse document frequency bit). The latter is the reason why we had to start by fitting the lexical search encoder to the text corpus.

BM25 adds two additional parameters into the mix, one that introduces a saturation curve - a point beyond which increasing the number of term occurences results in diminishing returns, and a normalization factor for the document's length (i.e. do you want to reward longer or more concise documents). The two parameters can be set to your preferred values when instantiating the *BM25Encoder* above, otherwise it defaults to the values that were fitted on the [MS MARCO dataset](https://microsoft.github.io/msmarco/).

Speaking of encoders, time to get that SentenceTransformer out!

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2', device=device)
encoder

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

I will be uploading the embedded chunks straight to Pinecone vector storage - mostly because (a) it's a fully managed persistence solution and their free tier is more than enough for what we are doing here, and (b) it already has hybrid search implemented. If you haven't done it yet, you'll need to create an account and note down your API Key. The free tier currently allows for a single index, so if you end up modifying things later on and reuploading the embeddings, you might want to run the *pc.delete_index("test")* cell below.

In [None]:
pc = Pinecone(api_key="...")

index_name = "test"

# only create index if it doesn't exist
if index_name not in pc.list_indexes().names():
  pc.create_index(
        name=index_name,
        dimension=encoder.get_sentence_embedding_dimension(),
        metric='dotproduct', # Make sure to set the metric to dotproduct if you intend to run hybrid search
        spec=PodSpec(environment="gcp-starter")
    )

# now connect to the index
vector_store = pc.Index(index_name)

In [None]:
# in case you decide to start over:
pc.delete_index(index_name)

What we are going to do now is that for each chunk we produce two vectors: a **dense** 384-dimensional semantic embedding, and a **sparse** embedding that assigns a BM25 score to each of the tokens that show up in the chunk.

In [None]:
dense = encoder.encode(corpus[20])
sparse = bm25.encode_documents(corpus[20])
print(corpus[20])
print(dense[:5])
print(sparse['indices'][:5])
print(sparse['values'][:5])

plaquette terms [UNK] ( z ) and [UNK] ( z + 1 ) centered at the same xycoordinates on adjacent layers atzandz + 1 respectively as shown in fig. 1 ( b - c ). mul - tiplying the two operators leads to an eight - spin operator [UNK] = [UNK] ( z ) [UNK] ( z + 1 ) = [UNK], ( 2 ) associated with a cube, as shown in fig. 1 ( a ). all such cubic terms commute with each other. the hamiltonian of our three dimensional model is simply the sum of all [UNK] taken with a global minus sign : h =
[-0.6019988   0.00850145 -0.23062377  0.09835535 -0.5358248 ]
[4001694969, 3540170031, 1968249149, 3254163991, 2484513939]
[0.4461530414770647, 0.6170205070707802, 0.8493719683068038, 0.76315730613523, 0.8011044535021319]


For each chunk, we will upload these two vectors, along with the chunk's content (the text) and its metadata, to our Pinecone index:

In [None]:
batch_size = 128

for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(chunks))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': chunk.page_content, 'source': chunk.metadata['source'], 'page': chunk.metadata['page']} for chunk in chunks[i:i_end]]
    # create embeddings
    contents = [chunk.page_content for chunk in chunks[i:i_end]]
    dense_vectors = encoder.encode(contents)
    sparse_vectors = bm25.encode_documents(contents)

    vectors = []
    for _id, sparse, dense, metadata in zip(
        ids, sparse_vectors, dense_vectors, metadatas
   ):

        vectors.append({
            'id': _id,
            'sparse_values': sparse,
            'values': dense,
            'metadata': metadata
        })

    vector_store.upsert(vectors)



# check number of records in the index
vector_store.describe_index_stats()

  0%|          | 0/4 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.00256,
 'namespaces': {'': {'vector_count': 256}},
 'total_vector_count': 256}

If the *total_vector_count* for the Pinecone index above is not yet equal to the number of chunks, it might just need a moment to get updated.

In [None]:
vector_store.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00438,
 'namespaces': {'': {'vector_count': 438}},
 'total_vector_count': 438}

Here we go.

## 2. Data retrieval

Now that our vector store is ready, we can submit some queries and see where that gets us. We might as well use hybrid search from the get go, with a scaling factor that lets us contiously move between semantic and lexical search (I borrowed the code for this [here](https://www.pinecone.io/learn/hybrid-search-intro/)). The scaling factor *alpha* ranges from 0 (pure lexical) to 1 (pure semantic).

In [None]:
def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense]
    return hdense, hsparse

Let's go ahead and ask some questions! One of the papers I am using for setting up this experiment dates back to my days as a quantum physicist. The subject is fractons - quasiparticles that may appear in certain quantum systems exhibiting topological order (if this tells you nothing, don't worry about it). One of their defining charactersistics is that an isolated fracton cannot be moved in space. What happens if we ask our Pinecone database about it?

In [None]:
query = 'Can you move an isolated fracton'

# In order to do a similarity search among the embedding vectors,
# you first need to embed the query in the same vector space:
dense = encoder.encode(query).tolist()

results = vector_store.query(top_k=5, vector=dense, include_metadata=True)

print(results)

{'matches': [{'id': '46',
              'metadata': {'page': 3.0,
                           'source': '/content/gdrive/MyDrive/test_RAG/1709.10094.pdf',
                           'text': 'only connect diagonal plaquettes. vertical '
                                   'pairs of cubes can only be separated along '
                                   '[UNK] z. therefore, all four fractons '
                                   'localized at the corners of the operator '
                                   'with two - dimensional support share the '
                                   'same flavor. it is impossi - ble to alter '
                                   'the position of a single fracton at a time '
                                   'with - out paying an additional energy '
                                   'cost. instead, fractons can only be '
                                   'shifted in pairs : either in two - '
                                   'dimensional xy planes [ fig. 


Yay! The top-ranked passage contains the answer to our question: "it is impossible to alter the position of a single fracton at a time without paying an additional energy cost."

Would we have gotten here if it wasn't for semantic search? We can check by running a hybrid search query (and setting *alpha* to 0).

In [None]:
# Create the sparse encoding for our query
sparse = bm25.encode_queries(query)

# Get the rescaled dense and sparse vectors
dense_vec, sparse_vec = hybrid_scale(dense, sparse, alpha=0)

# search
results = vector_store.query(
    top_k=5,
    vector=dense_vec,
    sparse_vector=sparse_vec,
    include_metadata=True
)

print(results)

{'matches': [{'id': '80',
              'metadata': {'page': 5.0,
                           'source': '/content/gdrive/MyDrive/test_RAG/1709.10094.pdf',
                           'text': '##like particles, fractons, pairs of which '
                                   'can be combined into composite excita - '
                                   'tions that move either in a straight line '
                                   'along the [UNK] zdirec - tion, or freely '
                                   'in the xyplane at a set height z. the abil '
                                   '- ity to combine fractons into mobile '
                                   'particles that move in spaces with reduced '
                                   'dimensionality is common across many '
                                   'fracton models. we find that the presence '
                                   'of zero energy modes on the surfaces '
                                   'perpendicular to [UNK] x

The first passage talks about fractons, but does not answer our question. The second one comes from a different paper altogether, and has no mention of fractons. Finally, the third passage contains information about fractons being "immobile pointlike particles". Not too bad! Still, the semantic search performed better due to to its ability to compare the meaning behind passages rather than focusing on the exact token match. What if instead we ask a LLM to expand the query with related words (e.g. synonyms) and ran the lexical search again with the expanded query?

I am going to use OpenAI's *gpt-3.5-turbo* for the query exansion, and ask for the output in JSON format. You are, of course, welcome to use an open source model of your choice, but keep in mind that large models perform best on this sort of zero-shot in-context learning task.

In [None]:
OPEN_AI_KEY = '...'

from openai import OpenAI
client = OpenAI(api_key = OPEN_AI_KEY)

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": """The user is going to give you a search query. First, generate between five and ten alternative \
    queries with the same meaning. Second, output a list of individual words present in these alternative queries, excluding any \
    duplicates or words that were present in the original query. The output should be in JSON format: \
    {
      "alternative queries": [...],
     "terms": [...]
    }"""},
    {"role": "user", "content": query}
  ]
)
print(completion.choices[0].message.content)

{
  "alternative queries": [
    "Is it possible to transport a standalone fracton?",
    "Moving a single fracton, is it feasible?",
    "Can you relocate an independent fracton?",
    "Is it doable to shift a solitary fracton?",
    "Transporting an isolated fracton, how is it done?"
  ],
  "terms": [
    "isolated",
    "standalone",
    "feasible",
    "relocate",
    "shift",
    "solitary",
    "transporting",
    "how",
    "done"
  ]
}


To remind you, the original query was 'Can you move an isolated fracton'. I am seeing "isolated" appear in the list of terms above, even though I asked GPT to keep the original query's words out of it, and I can tell you from past experience that the model can ignore the "no duplicates" requirement as well. Still, it does a reasonable enough job, even though "how" and "done" are probably not very useful as far as query expansion goes. Let's see how this new query does in the field:

In [None]:
import json

# The GPT completion is actually a string that looks like JSON right now,
# so let's start by converting it to a dictionary
completion_to_json = json.loads(completion.choices[0].message.content)

expanded_query = query + ' ' + ' '.join(completion_to_json['terms'])

print(expanded_query)

Can you move an isolated fracton isolated standalone feasible relocate shift solitary transporting how done


In [None]:
# Create the sparse encoding for the expanded query
expanded_sparse = bm25.encode_queries(expanded_query)

dense_vec, sparse_vec = hybrid_scale(dense, expanded_sparse, alpha=0)

# search
new_results = vector_store.query(
    top_k=5,
    vector=dense_vec,
    sparse_vector=sparse_vec,
    include_metadata=True
)

print(new_results)

{'matches': [{'id': '46',
              'metadata': {'page': 3.0,
                           'source': '/content/gdrive/MyDrive/test_RAG/1709.10094.pdf',
                           'text': 'only connect diagonal plaquettes. vertical '
                                   'pairs of cubes can only be separated along '
                                   '[UNK] z. therefore, all four fractons '
                                   'localized at the corners of the operator '
                                   'with two - dimensional support share the '
                                   'same flavor. it is impossi - ble to alter '
                                   'the position of a single fracton at a time '
                                   'with - out paying an additional energy '
                                   'cost. instead, fractons can only be '
                                   'shifted in pairs : either in two - '
                                   'dimensional xy planes [ fig. 

We are back to the top-ranked passage containing the answer! Feel free to play around with this using your own documents - perhaps you too will find that lexical search alone (well, with query expansion) gets you where you need to be.

---

I did not bother with the actual hybrid (meaning, a combination of semantic and lexical) search here because in the example above, there was no need to try to improve the semantic search by adding keywords into the mix. That can easily be done by setting *alpha* to some non-zero, neither-one value, or bypassing the hybrid_scale function and simply using Pinecone's *query* method with a dense- and sparse-encoded query (this is akin to setting *alpha=0.5*).

Now, this was all just exploring whether there was something behind my intuition that dense embeddings could be effectiuvely replaced by expanded queries without much (or perhaps any) loss of performance. Stay tuned for Part II for the benchmarks on actual information retrieval datasets to see if this holds beyond the sandbox :)