# 5 - Information Retrieval
## CSCI E-108
### Steve Elston

## Introduction

Document search is a core data mining methodology. For example, document search is an essential component of solutions like web search, recommender systems and RAG algorithms for constraining results from large language models.  

This notebook contains a simple example of vector document search. The search approach used in this notebook is deliberately restricted. A real-world solution would be considerably more complex. In this case, the search is performed in two simple stages.    
1. An approximate vector search using a scaleable index.
2. A reranking or refinement step using a more exact, but slower algorithm, applied the results of the first search.   

Both of these steps use only a dense text embedding. In many cases a combination of sparse and dense methods are often applied.  

As one might expect, production quality search systems require a great deal more sophistication than is applied here. Further, we pursue only one of many possible architectures to this problem.

### Unsupervised vector text search

As you work though this notebook, you may notice that something is missing. There is no objective evaluation. Ideally, one would want to compare models using metrics like recall and precision to evaluate the search. In this case, we have no labeled cases, so we will proceed on a fully unsupervised basis.  

## Setup

In this section, you will perform the setup steps required to execute the remainder of this notebook. This notebook is intended to run in Google Colab using a GPU. Access to a [Google Drive](https://workspace.google.com/products/drive/) account is also required. If you are not familiar with working in Colab you can find a [quick start guide here](https://docs.google.com/document/d/1afPjc4IaeZzIqUAX20uBEk3Dt41pAP0Ebkpd53EJTaE/edit?tab=t.0).   

To configure the colab runtime environment click the Change runtime type under the Runtime menu tab. Select a GPU type for execution.   

### Package installation and imports

As a first step, install the required packages by executing the code in the cell below.  

> **Note:** There are multiple possible version conflicts which arise when creating an environment where FAISS runs. You may well see errors from the pip installer. If this occurs you need to click `Restart session` under the Runtime menu tab and execute the code in the cell below again.

In [None]:
!pip install -qU \
    sentence-transformers \
    torch==2.6.0 \
    pandas==2.2.2 \
    datasets==2.12.0 \
    pyarrow<20.0.0a0\
    numpy==2.2\

!pip install faiss-gpu-cu12
!pip install pinecone-text

Once you have successfully installed the required packages, execute the code in the cell below to perform the imports. As verification your your runtime environment you should see two messages:  
1. An indication that you are running on Cuda, and therefore using GPU acceleration.    
2. Confirmation that you Google Doc drive has been mounted.    

In [None]:
import numpy as np
import pandas as pd
import faiss
from pinecone_text.sparse import BM25Encoder
import os
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm
import torch
import textwrap

from sentence_transformers import SentenceTransformer, CrossEncoder

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"running on {device}")

from google.colab import drive
drive.mount('/content/drive')

### Loading and exploring the data file   

Before proceeding you must do the following:   
1. Download the zipped data file from the [Wikipedia Movie Plots](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) page.     
2. Unzip the file.
3. Place the unzipped file in your Google Drive file system. You can place this file at the top level or in a directory.     
4. In the code cell below edit the `file_path` variable to match the location and name of your unzipped file. For example, if you file is named `wiki_movie_plots.csv` and is in a directory called `CSCIS108`, the full path name is `"/content/drive/MyDrive/CSCIS109/wiki_movie_plots.csv"`.

With the foregoing steps completed, run the code in the cell below to load the file. You should see the 8 column names printed as a verification of the loading.

In [None]:
file_path = "/content/drive/MyDrive/YOUR PATH HERE"
movie_plots = pd.read_csv(file_path)
print(movie_plots.shape)
print(movie_plots.columns)

To get a feel for what these data look like, execute the code in the cell below to display the title and plot summary of a few movies.  

In [None]:
def print_movies(movies, line_length=100):
  if isinstance(movies, pd.Series):
    print(movies.loc['Title'])
    print(textwrap.fill(movies.loc['Plot'],  width=line_length))
    print('\n')
  else:
    for i in range(len(movies)):
      print(movies.loc[i,'Title'])
      print(textwrap.fill(movies.loc[i,'Plot'],  width=line_length))
      print('\n')

print_movies(movie_plots.iloc[:10,:])

Notice that the plot summaries are all in unstructured natural language format.   

> **Note:** In the interest of brevity, we are truncating the exploration process at this point. Examples of further exploration we would normally perform include:  
> 1. Using ordered bar charts, often on a log scale, to explore the frequency of words in the corpus and the queries.
> 2. Examining the distribution of simple Jaccard similarity between a sample of the documents and queries.   
> 3. Performing the foregoing explorations with the sample stratified by other attributes like genre, range of release year, etc.  

## Vector Search   

You will now employ an approximate vector similarity search on the movie plots to perform searches for similar movies. Searches can be performed in two ways which are a component of **content-based recommender algorithms**.   
1. Finding movies with semantically similar plots to a current movie.    
2. Finding movies with semantically similar plots to a use query.    

The vector similarity search requires two steps.    
1. First, the the natural language of the plot summaries are transformed to embedding vectors. To perform the similarity search, the natural language queries are transformed to the same embedding space.     
2. Efficient approximate nearest neighbor search algorithms search indexes will be applied to the search on the embedding vectors.

> **Exercise 5-01:** One might well ask, why not use an LLM to perform the document search? There are two significant reasons. One reason is the slow query speed of the LLM, which can lead to significant lag is a great number of queries must be process. The other reason is cost. Consider a use case for a medium size enterprise applications with the following parameters:
>   - 100 thousand shards of 500 tokens each is uploaded (updated) daily to the corpus.
>   - 10 million queries per day with average query length of 500 tokens.
>   -  Generated results of 100 tokens per query.
>   - Cost to upload 1 million tokens is $.10.
>   - Cost to generate 1 million tokens $.50.
> 1. What is the daily cost for this system?
> 2. What is the annual cost for 250 business days per year?
> 3. For a medium size enterprise with revenue of $50 million per year how significant is this cost?    

> **Answer:**
> 1.    
> 2.    
> 3.  

### Create embeddings    

The [SentanceTransformers model](https://sbert.net/docs/sentence_transformer/pretrained_models.html) from the Transformers package is a model with several possible sets of pretrained weights. In the interest of fast execution, but mostly likely lower accuracy, we will use a small pretrained weight set. Cosine similarity is used since the embeddings are being computed for natural language.

The sentence transformers model includes a tokenizer. The embedding is computed from the first $m$ tokens encountered. As a result, long text sequences are truncated. To overcome this limitation we could shard the document and create embeddings for each shard. In the interest of simplicity, we skip this process here.  

> **Note 1:** Given the length of many of these plot summaries, it is likely that the number of tokens required exceeds the limit of the model. To some extent, this problem can be overcome by **sharding** the long text sequences. We will not do so here in the interest of simplicity and brevity. 

> **Note 2:** There are a quite a few [**pretrained models**](https://sbert.net/docs/sentence_transformer/pretrained_models.html) available in the SentanceTransformers package. Finding the best model choice is a mater of considerable experimentation, which we skip here. We will use the `multi-qa-mpnet-base-dot-v1` which often provides good results for semantic similarity search problems. If you find that this large model executes too slowly or requires too much memory in your environment you can use the much smaller `all-MiniLM-L6-v2` model.

Execute the code in the cell below to download the weights and instantiate the model.   

In [None]:
#sentance_model = 'sentence-transformers/all-MiniLM-L6-v2'
sentance_model = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
model = SentenceTransformer(
    sentance_model,
    similarity_fn_name='cosine',
    device=device
)

To see the properties of this model, execute the code in the cell below.

In [None]:
model

> **Exercise 5-02:** Examine the properties of the embedding model and answer these questions:
> 1. What is the maximum number of tokens processed by the model's tokenizer?
> 2. What limitation will this place on the embeddings given the plot summaries you have viewed and how do you think this limit will affect the similarity search?
> 3. How would sharding the plot summaries help this problem? 
> 4. How long are the embedding vectors produced?

> **Answers:**
> 1.       
> 2.     
> 3.         

The code in the cell below performs two operations.
1. We split the dataset into train and query (test) sets.
2. The natural language plots are embedded using the sentence transformers model.

Execute the code in the cell below and notice the dimensions of the two embedding arrays.

In [None]:
np.random.seed(4242)
movie_plots_train, movie_plots_test = train_test_split(movie_plots, test_size=0.03)
%time embeddings_test = model.encode(movie_plots_test.loc[:,'Plot'].to_numpy())
%time embeddings_train = model.encode(movie_plots_train.loc[:,'Plot'].to_numpy())
print(f"Dimensions of train embeddings: {embeddings_train.shape}  Dimensions of test embeddings: {embeddings_test.shape}")

### Build the index   

The code in the cell below constructs a composite index we will use for the scaleable similarity search. To ensure we are using a cosine metric we construct the composite index a step at a time in the `build_index` function. The index is built by these steps:    
1. The rotation process is defined.
2. An L2 or dot product search operator is defined.   
3. An inverted index coarse PQ quantizer using the dot product search is defined.   
4. A refinement search operator using the improved PQ algorithm is defined.
5. Finally, a flat refinement search is defined.   

Once the index is instantiated, the vectors are L2 normalized, the index is trained and the vectors are added to the index.  

Execute the code in the cell below and examine the training time and size.  

In [None]:
def get_memory(index, digits=2):
    # write index to file
    faiss.write_index(index, './temp.index')
    # get file size
    file_size = os.path.getsize('./temp.index')
    # delete saved index
    os.remove('./temp.index')
    return round(file_size/1000000, digits)

def build_index(D, m, nbits, nlist):
  # Initialize OPQ and coarse+fine quantizer steps
  opq = faiss.OPQMatrix(D, m)
  vecs = faiss.IndexFlatL2(D)
  # Define the subindex
  sub_index = faiss.IndexIVFPQ(vecs, D, nlist, m, nbits)
  # Define the index
  index = faiss.IndexPreTransform(opq, sub_index)
  # Add all of the previous steps to final refinement step
  index = faiss.IndexRefineFlat(index)
  return index

# L2 normalize the training vectors to get cosine distance from L2 dot product distance
faiss.normalize_L2(embeddings_train)

# Build the index
D=embeddings_train.shape[1]
m = 64
nbits = 8
nlist = 4096
index_IVFL2_PQ = build_index(D, m, nbits, nlist)

# Train the index and add the vectors
%time index_IVFL2_PQ.train(embeddings_train)
index_IVFL2_PQ.add(embeddings_train)
print(f'Index size: {get_memory(index_IVFL2_PQ)} MB')

> **Exercise 5-03:** We are using typical choice of **cosine similarity** between natural language embedding vectors. However, you will notice that L2 or dot product is used to specify the index. What step in the computation of the similarity search transforms to metric to cosine similarity and why?   

> **Answer:**    

### Similarity search between movies

As a first exercise of the vector search index we will find the movies with the highest semantic similarity too one movie in the test dataset. Execute the code in the cell below to display the title and plot of this movie.  

In [None]:
print_movies(movie_plots_test.iloc[0,:])

Now we are ready to perform the query by the following steps.   
1. The query embedding is L2 normalized.     
2. Hyperparameters of $k=100$ and $nprob=128$ are set.  
3. The search is executed.     

Execute the code.

In [None]:
faiss.normalize_L2(embeddings_test)

k=100
index_IVFL2_index = faiss.extract_index_ivf(index_IVFL2_PQ)
index_IVFL2_index.nprobe = 128
%time dist, indx = index_IVFL2_PQ.search(embeddings_test[0,:].reshape(1,D), k=k)
print(f"Distance to results: {dist}")
print(f"Index to results: {indx}")

> **Exercise 5-04:** You can see the nearest neighbor distances of the plots in ascending order, along with the indexes of the vectors.
> 1. What does the rapidly increasing distance tell you about the relationship of the semantic similarity of the results of this query?    
> 2. Consider the role of the indices. How does the index help identify the movies to return as the results of the query?       

> **Answers:**
> 1.     
> 2.      

The code in the cell below displays the movie title, the cosine distance from the query embedding vector to the movie plot embedding vector, and the movie plot text. Execute this code and examine the results.  

In [None]:
def print_movies_distances(movies, line_length=100):
    for idx in movies.index:
      print("Title: " + movies.loc[idx,'Title'] + "     Distance: " + str(movies.loc[idx,'Distance']))
      print(textwrap.fill(movies.loc[idx,'Plot'],  width=line_length))
      print('\n')


results = movie_plots.loc[indx[0], ['Title', 'Plot']]
results['Distance'] = dist[0]
print_movies_distances(results.iloc[:10,:])

We have found the 10 movies that have the smallest cosine distance to the query. It is clearly subjective if these results represent similar movies from a human perspective.    

## Search with User Query

As another experiment with vector search, we will try a short natural language query. Users of a streaming service often create such queries when searching for content.   

The code in the cell below defines a short natural language query and performs a vector search on the embedding of this query. Execute this code and examine the results.

In [None]:
Sample_query = np.array(["Film with soldiers dealing with the transition from war to peace"])
Sample_query_embedded = model.encode(Sample_query)
faiss.normalize_L2(Sample_query_embedded)

index_IVFL2_index.nprobe = 128
%time dist, idx_query = index_IVFL2_PQ.search(Sample_query_embedded.reshape(1,D), k)

results2 = movie_plots.loc[idx_query[0], ['Title', 'Plot']]
results2['Distance'] = dist[0]
print_movies_distances(results2)

As with the first query experiment, the evaluation of these results is quite subjective. From a human perspective it is hard to determine if these movies are a good match to the query.  

## Reranking Transformers   

As one would expect, approximate nearest neighbor search will generally not produce the best search results. Therefore, a reranking step is often applied to find a shorter list best results from a long list produced by ANNS.   

There are several possible whys to construct a search and reranking pipeline to produce improved search results. You can find a summary of reranking at scale oriented toward RAG algorithms in [this post from Pinecone](https://www.pinecone.io/learn/series/rag/rerankers/). In summary two of several possible approaches are:    
1. A sparse vector search is followed by a dense vector search. The sparse search acts as a filter on key words in the query. The dense vector search refines these results. While this approach yield results with high precision and recall, the sparse search can be slow at massive scale.  
2. A scaleable, ideally $O(log\ n)$, dense ANNS search followed by a reranking based on of a shortened list using a pairwise comparison search. The $O(n)$ complexity of the pairwise search is therefore limited to the top $k$ results of the ANNS search. We take this approach here.  



Specifically in this case, we will use the [CrossEncoder](https://sbert.net/examples/cross_encoder/applications/README.html) from the Transformers package to compute pairwise similarity of our search results. To instantiate this model, execute the code in the cell below.

> **Note:** The pretrained model used in this case is only one of many [pretrained models](https://sbert.net/docs/cross_encoder/pretrained_models.html) available for CrossEncoders.

In [None]:
# 1. Load a pre-trained CrossEncoder model
pretrained_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
model_CrossEncoder = CrossEncoder(pretrained_model)

The `compute_scores` function defined below iterates over the pairwise distance calculation between the embedded query and the embedded target movie plots. Execute this code to compute the pairwise score or distances.    

In [None]:
def compute_scores(model, query, targets):
  scores = []
#  print(query)
  for string in targets:
    scores.append(model.predict([(query, string)])[0])
  return scores

query_index = movie_plots_test.index[0]
scores = compute_scores(model_CrossEncoder, movie_plots_test.loc[query_index,'Plot'], results.loc[:,'Plot'])
scores[:10]

As a final step, the code in the cell below adds the scores column to the data frame and performs a sort in descending order of distance to produce and displayed the top 10 reranked matches. Execute this code and examine the results.

In [None]:
## Sort the results data frame based on assending scores
results['Score'] = scores
results.sort_values(by='Score', ascending=False, inplace=True)
results.iloc[:10,:]

> **Exercise 5-05:** Examine these results and answer the following questions:
> 1. Why is the cross encoder a good choice for reranking, but would not be suitable for the initial retrieval step.
> 2. What quantitative evidence can you see in the table above of the effect of reranking?   

> **Answers:**
> 1.       
> 2.         

> **Exercise 5-06:**: The methods used in this notebook only scratch the surface of possibilities for improving information retrieval. Consider the possibilities and answer these questions:
> 1. One way to improve search results is to filter the raw database by one or more categorical attribute. For example, in the case of movies the genre might be one way to do so. If you were to implement such a filter, what type of index or data structure would you use for efficient filtering at scale. What is the computational complexity of your approach?
> 2. Another way one could improve search on a movie database is to search categorical tags. For example, similarity search on the actors in a movie can be used. However, as with many text-coded categorical variables the search process is not as simple as it sounds at first. Explain how a multi-hash algorithm can be used for this type of situation? Can you use the same type of data structure discussed for your answer to the first question for this situation and why?        

> **Answer:**
> 1.        
> 2.       

#### Copyright 2025, Stephen F. Elston. All rights reserved.