# 5 - Information Retrieval
## CSCI E-108
### Steve Elston

## Introduction

Document search is a core data mining methodology. For example, document search is an essential component of solutions like web search, recommender systems and RAG algorithms for constraining results from large language models.  

This notebook contains a simple example of vector document search. The search approach used in this notebook is deliberately restricted. A real-world solution would be considerably more complex. In this case, the search is performed in two simple stages.    
1. An approximate vector search using a scaleable index.
2. A reranking or refinement step using a more exact, but slower algorithm, applied the results of the first search.   

These steps can both be accomplished with dense text embedding. In other cases a hybrid of sparse and dense embedding methods is applied.  

As one might expect, production quality search systems require a great deal more sophistication than is applied here. Further, we pursue only one of many possible architectures to this problem.

The approach we use for retrieval here is truly unsupervised. We have no known rankings. For these situations, evaluation of retrieval solutions presents a significant problem. Indeed we have no objective basis for evaluation. Even if we went to the time and expense of creating a limited set of human rankings, the result would likely be biased in some way. An alternative is to compare these results to other approaches. For example, one could compare rankings to those produced by one or more LLMs. Even here, there is no way to know how good or biased the valuation performed is. In the interest of simplicity and limited effort, we will not attempt any evaluation here. Rather, we are limited to some subjective evaluation.       

### Unsupervised vector text search

As you work though this notebook, you may notice that something is missing. There is no objective evaluation. Ideally, one would want to compare models using metrics like recall and precision to evaluate the search. In this case, we have no labeled cases, so we will proceed on a fully unsupervised basis.  

## Setup

In this section, you will perform the setup steps required to execute the remainder of this notebook. This notebook is intended to run in Google Colab using a GPU. Access to a [Google Docs](https://workspace.google.com/products/drive/) account is also required. If you are not familiar with working in Colab you can find a [quick start guide here](https://docs.google.com/document/d/1afPjc4IaeZzIqUAX20uBEk3Dt41pAP0Ebkpd53EJTaE/edit?tab=t.0).   

To configure the colab runtime environment click the Change runtime type under the Runtime menu tab. Select a GPU type for execution.   

### Package installation and imports

As a first step, install the required packages by executing the code in the cell below.  

> **Note 1:** Execution of this notebook using A100 GPU on Google Colab Pro+ requires about 4 hours of GPU time and 1 hour of wall-clock time.  
>
> **Note 2:** There are multiple possible version conflicts which arise when creating an environment where FAISS runs. You may well see errors from the pip installer. If this occurs you need to click `Restart session` under the Runtime menu tab and execute the code in the cell below again.

In [None]:
!pip install -qU \
    sentence-transformers \
    torch==2.6.0 \
    pandas==2.2.2 \
    datasets==2.12.0 \
    pyarrow<20.0.0a0\
    numpy==2.2

!pip install rank-bm25
!pip install faiss-gpu-cu12
!pip install pinecone-text

Once you have successfully installed the required packages, execute the code in the cell below to perform the imports. As verification your your runtime environment you should see two messages:  
1. An indication that you are running on Cuda, and therefore using GPU acceleration.    
2. Confirmation that you Google Doc drive has been mounted.    

In [None]:
import numpy as np
import pandas as pd
import faiss
from rank_bm25 import BM25Okapi

import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tqdm.auto import tqdm
import torch
import textwrap
import matplotlib.pyplot as plt


from sentence_transformers import SentenceTransformer, CrossEncoder

import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"running on {device}")

from google.colab import drive
drive.mount('/content/drive')

### Loading and exploring the data file   

Before proceeding you must do the following:   
1. Download the zipped data file from the [Wikipedia Movie Plots](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) page.     
2. Unzip the file.
3. Place the unzipped file in your Google Drive file system. You can place this file at the top level or in a directory.     
4. In the code cell below edit the `file_path` variable to match the location and name of your unzipped file. For example, if you file is named `wiki_movie_plots.csv` and is in a directory called `CSCIS108`, the full path name is `"/content/drive/MyDrive/CSCIS109/wiki_movie_plots.csv"`.

With the foregoing steps completed, run the code in the cell below to load the file. You should see the 8 column names printed as a verification of the loading.

In [None]:
file_path = "/content/drive/MyDrive/Colab Notebooks/wiki_movie_plots.csv"
movie_plots = pd.read_csv(file_path)
print(movie_plots.shape)
print(movie_plots.columns)

To get a feel for what these data look like, execute the code in the cell below to display the title and plot summary of a few movies.  

In [None]:
def print_movies(movies, line_length=100):
  if isinstance(movies, pd.Series):
    print(movies.loc['Title'])
    print(textwrap.fill(movies.loc['Plot'],  width=line_length))
    print('\n')
  else:
    for i in range(len(movies)):
      print(movies.loc[i,'Title'])
      print(textwrap.fill(movies.loc[i,'Plot'],  width=line_length))
      print('\n')

print_movies(movie_plots.iloc[:10,:])

Notice that the plot summaries are all in unstructured natural language format.   

> **Note:** In the interest of brevity, we are truncating the exploration process at this point. Examples of further exploration we would normally perform include:  
> 1. Using ordered bar charts, often on a log scale, to explore the frequency of words in the corpus and the queries.
> 2. Examining the distribution of simple Jaccard similarity between a sample of the documents and queries.   
> 3. Performing the foregoing explorations with the sample stratified by other attributes like genre, range of release year, etc.  

## Preprocessing the Plot Text     

While not our focus here, we will follow basic common practice of preparing text data for natual language processing (NLP). These steps are:       
1. All characters are set to lower case, so the same word is considered two.   
2. Numbers are removed since they generally do not have semmantic meaning.     
3. Commonly used words with limited semantic value, such as conjunctives, know as [**stop words**](https://en.wikipedia.org/wiki/Stop_word), are removed.    

In a production scale system these processes can be parallelized quite readily. Execute this code and examine the results.   

In [None]:
def preprocess_text(text_list):
    """
    Preprocesses a list of text strings for use with Sentence Transformers.

    Args:
        text_list (list): A list of strings, where each string is a sentence or text segment.

    Returns:
        list: A list of preprocessed strings.
    """
    preprocessed_texts = []
    stop_words = set(stopwords.words('english'))

    for text in text_list:
        # Set characters to lowercasing
        text = text.lower()

        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))

        # Remove numbers
        text = re.sub(r'\d+', '', text)

        # Tokenization to words and then stop word removal
        words = word_tokenize(text)
        words = [word for word in words if word not in stop_words]
        preprocessed_text = ' '.join(words)

        # Remove extra whitespace
        preprocessed_text = ' '.join(preprocessed_text.split())

        preprocessed_texts.append(preprocessed_text)

    return preprocessed_texts


## Make sure we have the nltk stop words downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

## Download the nltk punkt tokenizer
#import nltk
nltk.download('punkt_tab')

%time movie_plots.loc[:,'Plot'] = preprocess_text(movie_plots.loc[:,'Plot'])
print_movies(movie_plots.loc[:3,:])

## Vector Search   

You will now employ an approximate vector similarity search on the movie plots to perform searches for similar movies. Searches can be performed in two ways which are a component of **content-based recommender algorithms**.   
1. Finding movies with semantically similar plots to a current movie.    
2. Finding movies with semantically similar plots to a use query.    

The vector similarity search requires two steps.    
1. First, the the natural language of the plot summaries are transformed to embedding vectors. To perform the similarity search, the natural language queries are transformed to the same embedding space.     
2. Efficient approximate nearest neighbor search algorithms search indexes will be applied to the search on the embedding vectors.

> **Exercise 5-01:** One might well ask, why not use an LLM to perform the document search? There are two significant reasons. One reason is the slow query speed of the LLM, which can lead to significant lag is a great number of queries must be process. The other reason is cost. Consider a use case for a small scale business application with the following parameters:
>   - $10^6$ shards of 500 tokens each are uploaded (updated) daily to the corpus.
>   - $10^6$ queries per day with average query length of 500 tokens.
>   - Generated results of 500 tokens per query.
>   - Cost to upload 1 million tokens is $.10.
>  - Cost to generate 1 million tokens $1.00.
> 1. What is the daily cost for this system?
> 2. What is the annual cost for 250 business days per year?
> 3. For a medium size business unit with revenue of $10 million per year how significant is this cost?    

> **Answer:**
>   
> 1. Daily cost =     
> 2. Annual cost =     
> 3.           

### Create embeddings    

The [SentanceTransformers model](https://sbert.net/docs/sentence_transformer/pretrained_models.html) from the Transformers package is a model with several possible sets of pretrained weights. In the interest of fast execution, but mostly likely lower accuracy, we will use a small pretrained weight set. Cosine similarity is used since the embeddings are being computed for natural language.

The sentence transformers model includes a tokenizer. The embedding is computed from the first $m$ tokens encountered. As a result, long text sequences are truncated. To overcome this limitation we could shard the document and create embeddings for each shard. In the interest of simplicity, we skip this process here.  

> **Note 1:** Given the length of many of these plot summaries, it is likely that the number of tokens required exceeds the limit of the model. To some extent, this problem can be overcome by **sharding** the long text sequences. We will not do so here in the interest of simplicity and brevity.

> **Note 2:** There are a quite a few [**pretrained models**](https://sbert.net/docs/sentence_transformer/pretrained_models.html) available in the SentanceTransformers package. Finding the best model choice is a mater of considerable experimentation, which we skip here. We will use the `multi-qa-mpnet-base-dot-v1` which often provides good results for semantic similarity search problems. If you find that this large model executes too slowly or requires too much memory in your environment you can use the much smaller `all-MiniLM-L6-v2` model.

Execute the code in the cell below to download the weights and instantiate the model.   

In [None]:
#sentence_model = 'sentence-transformers/all-MiniLM-L6-v2'
sentence_model = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
model = SentenceTransformer(
    sentence_model,
    similarity_fn_name='cosine',
    device=device
)

To see the properties of this model, execute the code in the cell below.

In [None]:
model

> **Exercise 5-02:** Examine the properties of the embedding model and answer these questions:
> 1. What is the maximum number of tokens processed by the model's tokenizer?
> 2. What limitation will this place on the embeddings given the plot summaries you have viewed and how do you think this limit will affect the similarity search?
> 3. How would sharding the plot summaries help this problem?  
> 4. How long are the embedding vectors produced?

> **Answers:**
> 1.        
> 2.         
> 3.       
> 4.           

The code in the cell below performs two operations.
1. We split the dataset into train and query (test) sets.
2. The natural language plots are embedded using the sentence transformers model.

Execute the code in the cell below and notice the dimensions of the two embedding arrays.

In [None]:
np.random.seed(4242)
movie_plots_train, movie_plots_test = train_test_split(movie_plots, test_size=0.03)
movie_plots_train.reset_index(inplace=True)
movie_plots_test.reset_index(inplace=True)
%time embeddings_test = model.encode(movie_plots_test.loc[:,'Plot'].to_numpy())
%time embeddings_train = model.encode(movie_plots_train.loc[:,'Plot'].to_numpy())
print(f"Dimensions of train embeddings: {embeddings_train.shape}  Dimensions of test embeddings: {embeddings_test.shape}")

In [None]:
print(movie_plots_train.head())
print("\n\n\n\n")
print(movie_plots_test.head())

### Build the index   

The code in the cell below constructs a composite index we will use for the scaleable similarity search. To ensure we are using a cosine metric we construct the composite index a step at a time in the `build_index` function. The index is built by these steps:    
1. The rotation process is defined.
2. An L2 or dot product search operator is defined.   
3. An inverted index coarse PQ quantizer using the dot product search is defined.   
4. A refinement search operator using the improved PQ algorithm is defined.
5. Finally, a flat refinement search is defined.   

Once the index is instantiated, the vectors are L2 normalized, the index is trained and the vectors are added to the index.  

Execute the code in the cell below and examine the training time and size.  

In [None]:
def get_memory(index, digits=2):
    # write index to file
    faiss.write_index(index, './temp.index')
    # get file size
    file_size = os.path.getsize('./temp.index')
    # delete saved index
    os.remove('./temp.index')
    return round(file_size/1000000, digits)

def build_index(D, m, nbits, nlist):
  # Initialize OPQ and coarse+fine quantizer steps
  opq = faiss.OPQMatrix(D, m)
  vecs = faiss.IndexFlatL2(D)
  # Define the subindex
  sub_index = faiss.IndexIVFPQ(vecs, D, nlist, m, nbits)
  # Define the index
  index = faiss.IndexPreTransform(opq, sub_index)
  # Add all of the previous steps to final refinement step
  index = faiss.IndexRefineFlat(index)
  return index

# L2 normalize the training vectors to get cosine distance from L2 dot product distance
faiss.normalize_L2(embeddings_train)

# Build the index
D=embeddings_train.shape[1]
m = 64
nbits = 8
nlist = 4096
index_IVFL2_PQ = build_index(D, m, nbits, nlist)

# Train the index and add the vectors
%time index_IVFL2_PQ.train(embeddings_train)
index_IVFL2_PQ.add(embeddings_train)
print(f'Index size: {get_memory(index_IVFL2_PQ)} MB')

> **Exercise 5-03:** We are using typical choice of **cosine similarity** between natural language embedding vectors. However, you will notice that L2 or dot product is used to specify the index. What step in the computation of the similarity search transforms to metric to cosine similarity and why?   

> **Answer:**         

### Similarity search between movies

As a first exercise of the vector search index we will find the movies with the highest semantic similarity too one movie in the test dataset. Execute the code in the cell below to display the title and plot of this movie.  

In [None]:
print_movies(movie_plots_test.iloc[0,:])

Now we are ready to perform the query by the following steps.   
1. The query embedding is L2 normalized.     
2. Hyperparameters of $k=100$ and $nprob=128$ are set.  
3. The search is executed.     

Execute the code.

In [None]:
def plot_dists(dists, ylabel='Distance', title='Distance vs. rank'):
  ranks = range(len(dists))
  ranks = [x+1 for x in ranks]
  plt.scatter(x=ranks, y=list(dists));
  plt.xlabel('Rank');
  plt.ylabel(ylabel);
  plt.title(title);

faiss.normalize_L2(embeddings_test)

k=100
index_IVFL2_index = faiss.extract_index_ivf(index_IVFL2_PQ)
index_IVFL2_index.nprobe = 128
%time dist, indx = index_IVFL2_PQ.search(embeddings_test[0,:].reshape(1,D), k=k)
dense_distance = pd.Series(dist[0], index=indx[0])
#print("\nIndex and distances")
#print(dense_distance)

plot_dists(dense_distance)

> **Exercise 5-04:** You can see the nearest neighbor distances of the plot texts in ascending order, along with the indexes of the vectors.
> 1. What does the shape of the distance curve, including limited range of values, tell you about the relationship of the semantic similarity of the results of this query?    
> 2. Consider the role of the indices. How does the index help identify the movies to return as the results of the query?       

> **Answers:**
> 1.        
> 2.       

The code in the cell below displays the movie title, the cosine distance from the query embedding vector to the movie plot embedding vector, and the movie plot text. Execute this code and examine the results.  

In [None]:
def print_movies_distances(movies, line_length=100):
    for idx in movies.index:
      print("Title: " + movies.loc[idx,'Title'] + "\nDistance: " + str(movies.loc[idx,'Distance']))
      print(textwrap.fill(movies.loc[idx,'Plot'],  width=line_length))
      print('\n')


results = movie_plots.loc[indx[0], ['Title', 'Plot']]
results['Distance'] = dist[0]
print_movies_distances(results.iloc[:10,:])

We have found the 10 movies that have the smallest cosine distance to the query. It is clearly subjective if these results represent similar movies from a human perspective.    

## Search with User Query

As another experiment with vector search, we will try a short natural language query. Users of a streaming service often create such queries when searching for content.   

The code in the cell below defines a short natural language query and performs a vector search on the embedding of this query. Execute this code and examine the results.

In [None]:
Sample_query = np.array(["Film with soldiers dealing with the transition from war to peace"])
Sample_query_embedded = model.encode(Sample_query)
faiss.normalize_L2(Sample_query_embedded)

index_IVFL2_index.nprobe = 128
k=10
%time dist, idx_query = index_IVFL2_PQ.search(Sample_query_embedded.reshape(1,D), k)

results2 = movie_plots_train.loc[idx_query[0], ['Title', 'Plot']]
results2['Distance'] = dist[0]
print_movies_distances(results2)

As with the first query experiment, the evaluation of these results is quite subjective. From a human perspective it is hard to determine if these movies are a good match to the query.  

## Reranking Transformers   

As one would expect, approximate nearest neighbor search will generally not produce the best search results. Therefore, a reranking step is often applied to find a shorter list best results from a long list produced by ANNS.   

There are several possible whys to construct a search and reranking pipeline to produce improved search results. You can find a summary of reranking at scale oriented toward RAG algorithms in [this post from Pinecone](https://www.pinecone.io/learn/series/rag/rerankers/). In summary two of several possible approaches are:    
1. A sparse vector search is followed by a dense vector search. The sparse search acts as a filter on key words in the query. The dense vector search refines these results. While this approach yield results with high precision and recall, the sparse search can be slow at massive scale.  
2. A scaleable, ideally $O(log\ n)$, dense ANNS search followed by a reranking based on of a shortened list using a pairwise comparison search. The $O(n)$ complexity of the pairwise search is therefore limited to the top $k$ results of the ANNS search. We take this approach here.  



Specifically in this case, we will use the [CrossEncoder](https://sbert.net/examples/cross_encoder/applications/README.html) from the Transformers package to compute pairwise similarity of our search results. To instantiate this model, execute the code in the cell below.

> **Note:** The pretrained model used in this case is only one of many [pretrained models](https://sbert.net/docs/cross_encoder/pretrained_models.html) available for CrossEncoders.

In [None]:
pretrained_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
model_CrossEncoder = CrossEncoder(pretrained_model)

The `compute_scores` function defined below iterates over the pairwise distance calculation between the embedded query and the embedded target movie plots. Execute this code to compute the pairwise score or distances.    

In [None]:
def compute_scores(model, query, targets):
  scores = []
  for string in targets:
    scores.append(model.predict([(query, string)])[0])
  return scores

query_index = movie_plots_test.index[0]
scores = compute_scores(model_CrossEncoder, movie_plots_test.loc[query_index,'Plot'], results.loc[:,'Plot'])
scores[:10]

As a final step, the code in the cell below adds the scores column to the data frame and performs a sort in descending order of distance to produce and displayed the top 10 reranked matches. Execute this code and examine the results.

In [None]:
## Sort the results data frame based on assending scores
results['Score'] = scores
results.sort_values(by='Score', ascending=False, inplace=True)
results.iloc[:10,:]

> **Exercise 5-05:** Examine these results and answer the following questions:
> 1. Why is the cross encoder a good choice for reranking, but would not be suitable for the initial retrieval step.
> 2. What quantitative evidence can you see in the table above of the effect of reranking?   

> **Answers:**
> 1.             
> 2.         

## Sparse Coding and Hybrid Coding     

So far we have only used dense embedding using SentanceTransformers which provides semantic similarity scores. However, dense encoding tends to have low sensitivity to specific terms in the query. Sparse embedding is another powerful approach which computes scores sensitive to specific terms. But, sparse embedding does not account for semantics.  

It is possible to combine the strength of both dense and sparse embedding . In the following we will create a hybrid indexer by using FAISS with dense embedding and BM25 for sparse embedding. For this hybrid recommender, the two algorithms are run in parallel. A hybrid score is computed by weighting and summing the scores from the two algorithms.

Since sparse embedding of a large corpus can be slow and memory intensive, we focus here on using it for reranking. Specifically we will do the following.  
1. Start with the top 1000 most similar plots to the query as determined by the dense search using FAISS.
2. Sparse embed the plot summaries of the 1000 movies using the BM25 algorithm.
3. Embed the query plot using BM25.
4. Score the similarity between the query and the 1000 movie plots.
5. Combine the scores from BM25 embedding with the cross-encoder scores computed in the foregoing section.    

### Tokenize and sparse embed movie plots     

We will now apply the BM25 algorithm to sparse embed the the text of the movie plots. There are two steps.        
1. For the The BM25 algorithm operates on tokenized text. The code in the cell below applies a simple tokenization to the text, using white-space between words.      
2. The tokenized text is sparse encoded using the (Okapi BM25 algorithm)[https://en.wikipedia.org/wiki/Okapi_BM25]. The code uses (BM25Okapi function from the[link text](https://) Rank-BM25 package)[https://github.com/dorianbrown/rank_bm25]

Execute this code.      

In [None]:
def tokenize_text(text_list):
  out = []
  for string in text_list:
    out.append(word_tokenize(string))
  return out


## Tokenize the plot text
tokenized_corpus = tokenize_text(list(movie_plots_train.loc[dense_distance.index,'Plot']))

## Sparse encode the tokenized documents
bm25 = BM25Okapi(tokenized_corpus)


### Query the BM25 sparse coding        

With the sparse inverted index constructed we can now perform efficient queries. The `bm25_query` function does the following.      
1. Tokenizes the text.     
2. Computes the scores for each to the of the embedded plots.   
3. Standardize the scores.    

Execute this code and examine the results.  

In [None]:
def bm25_query(model, query_text):
  tokenized_query = word_tokenize(query_text)
#  tokenized_query = query_text.split(" ")
  scores = model.get_scores(tokenized_query)
  scores =(scores - np.min(scores)) / (np.max(scores) - np.min(scores))
  return pd.Series(scores)

scores = bm25_query(bm25, movie_plots_test.loc[0,'Plot'])
plot_dists(scores.sort_values(ascending=False), ylabel='Score', title='Score vs. rank')

### Compute the hybrid score    

We will now compute the hybrid approximate similarity score, integrating the sparse embedding and dense embedding scores.    

The first step is to standardize the two scores. Standardizing the scores is critical since one does not want differences in their numeric ranges to bias the integrated result. Execute the code to apply a scalar to the two score vectors.  

In [None]:
results.loc[:,'BM25_scores'] = list(scores)

cols_to_scale = ['Score','BM25_scores']
scaler = StandardScaler()
results[cols_to_scale] = scaler.fit_transform(results[cols_to_scale])

results[['Score','BM25_scores']]

With the scores standardized, they can now be integrated into a hybrid score. Here we use a simple approach, an equal weight sum of the standardized scores. Execute the code and examine the results.  

In [None]:
sparse_weight = 0.5

integrated_scores = np.add(sparse_weight * results['BM25_scores'], (1 - sparse_weight) * results['Score'])
hybrid_scores = integrated_scores.sort_values(ascending=False)
hybrid_scores

Finally, you can examine the top similarity matches by executing the code in the cell below.   

In [None]:
## Sort the results data frame based on assending scores
results.loc[:,'hybrid_scores'] = list(hybrid_scores)
results.sort_values(by='hybrid_scores', ascending=False, inplace=True)
plot_dists(results.loc[:,'hybrid_scores'].sort_values(ascending=False), ylabel='Score', title='Hybrid score vs. rank')
results.iloc[:10,:]

> **Exercise 5-06:** Examine the results computed with the hybrid reranker and answer these questions.
> 1. Comparing the scores for the dense cross-embedding, the sparse embedding and the hybrid of the two what can you say about the independence of these scores? How does the independence, or not, of the scores affect the independence of the hybrid score?
> 2. Compare the curves of the BM25 scores and the hybrid score. What does the difference tell you about the effectiveness of the hybrid score for ranking.        
> 3. Consider the trade-off hyperparameter `sparse_weight`. How will increasing or decreasing this hyperparameter change the resulting hybrid score?
> 4. Earlier we applied preprocessing to the movie plot text to a) set all characters to lower case, b) remove numbers, c) remove stop words, and d) remove punctuation. How do these preprocessing steps help to improve the similarity scores computed using the sparse embeddings?
> 5. An alternative to using the BM25 algorithm is to create sparse embeddings with a more sophisticated algorithm like SPLADE. What are the advantages and disadvantages of such an approach compared to the one applied here?     

> **Answers:**
> 1.         
> 2.    
> 3.        
> 4.        
> 5.   

## Adding Additional Features     

Up to now, we have only dealt with a single attribute of a movie, a plot summary, for similarity search. However, there are other attributes of the movie generally available in the dataset. Including . Adding embeddings of other movie attributes can lead to better similarity matches. Such an approach is common in solutions such as [**content-based recommender systems**](https://en.wikipedia.org/wiki/Recommender_system).       

### Exploring the variables

To get started, let's examine the data types of other movie attribute variables. Execute the code in the cell below and examine the results.

In [None]:
movie_plots.dtypes


With the exception of the release date, all of the other attributes are text strings.    

To further explore these attributes, let's look at a sample of these attributes. Execute the code in the cell below and examine the results.    

In [None]:
for col in ['Release Year','Origin/Ethnicity', 'Director', 'Cast','Genre']:
  print('\n\n' + col)
  print(movie_plots.loc[:20,col])

Notice that these attributes are coded as short text strings, such as names of key words. There are also a number of missing values.       

The next question to ask is, what is the cardinality of these attributes? To find out, execute the code in the cell below and examine the results.  

In [None]:
for col in ['Origin/Ethnicity', 'Director', 'Cast','Genre']:
  print('\n\n' + col)
  print(len(movie_plots.loc[:,col].unique()))

Several of these attributes have high cardinality.    

The Origin/Ethnicity variable has low cardinality, but what about case (class) imbalance? To find out, execute the code in the cell below and examine the results.   

In [None]:
movie_plots['counts'] = 1
movie_plots.loc[:,['counts', 'Origin/Ethnicity']].groupby(['Origin/Ethnicity']).agg('count')

Notice that several of these classes have very few cases, whereas others have many cases.     

### Identifying and treating missing values

To continue the exploration of the data we will now explore missing values. One issue that must be dealt with is that the coding of missing values is inconsistent between the different attribute variables.       

To start the exploration of missing values execute the code in the cell below and examine the results.

In [None]:
for col, missing in zip(['Director', 'Genre'], ['Unknown', 'unknown']):
  print('\n' + col)
  print(len(movie_plots.loc[movie_plots.loc[:,col] == missing,col]))

To continue the exploration of missing values, execute the code in the cell below.   

In [None]:
print(sum(movie_plots.loc[:,'Cast'].isna()))

In order to process these unstructured variables we need to use a consistent coding for the missing values. The code in the cell below codes missing values in two variables to the string 'unknown'.

In [None]:
movie_plots_train.loc[movie_plots_train.loc[:,'Cast'].isna(),'Cast'] = 'unknown'
movie_plots_train.loc[movie_plots_train.loc[:,'Director'] == 'Unknown','Director'] = 'unknown'

### Embed the features      

With the attribute variables explored and prepared we are now ready to create an embedding. There are several reasonable paths we can take at this point:     
1. **Embed each variable independently** and then apply appropriate weights to combine the results. This approach has the advantage of flexibility at the cost of speed and complexity.     
2. **Embed the concatenated text strings** of these features and then embed. This approach lacks flexibility but is simple and faster, so we will adopt it here.      

To create the concatenated text strings, execute the code in the cell below.  

In [None]:
space_vector_train = np.array([' '] * movie_plots_train.shape[0])
space_vector_test = np.array([' '] * movie_plots_test.shape[0])

movie_plots_train['meta_information'] = movie_plots_train.loc[:,'Director'] + \
                                  space_vector_train + \
                                  movie_plots_train.loc[:,'Cast'] + \
                                  space_vector_train + \
                                  movie_plots_train.loc[:,'Origin/Ethnicity'] + \
                                  space_vector_train + \
                                  movie_plots_train.loc[:,'Genre'] + \
                                  space_vector_train + \
                                  movie_plots_train.loc[:,'Release Year'].to_string(index=False)

movie_plots_test['meta_information'] = movie_plots_test.loc[:,'Director'] + \
                                  space_vector_test + \
                                  movie_plots_test.loc[:,'Cast'] + \
                                  space_vector_test + \
                                  movie_plots_test.loc[:,'Origin/Ethnicity'] + \
                                  space_vector_test + \
                                  movie_plots_test.loc[:,'Genre'] + \
                                  space_vector_test + \
                                  movie_plots_test.loc[:,'Release Year'].to_string(index=False)


movie_plots_test.loc[:10,'meta_information']

### Embed the features      

We are now ready to embed the concatenated movie meta-attributes. To proceed, execute the code in the cell below.        

In [None]:
model_meta =  SentenceTransformer(
    sentence_model,
    similarity_fn_name='cosine',
    device=device
)

%time embeddings_meta_train = model_meta.encode(movie_plots_train.loc[:,'meta_information'].to_numpy())
%time embeddings_meta_test = model_meta.encode(movie_plots_test.loc[:10,'meta_information'].to_numpy())

print(embeddings_meta_train.shape)
print(embeddings_meta_test.shape)

In [None]:
#%time embeddings_meta_test = model_meta.encode(movie_plots_test.loc[:10,'meta_information'].to_numpy())
#print(embeddings_meta_test.shape)

The next step is to concatenate the embedding of the plot summaries and the embeddings of the movie attributes into a single embedding vector.

In [None]:
concatenated_embeddings_train = np.concatenate((embeddings_train, embeddings_meta_train), axis=0)
concatenated_embeddings_test = np.concatenate((embeddings_test, embeddings_meta_test), axis=0)

print(concatenated_embeddings_train.shape)
print(concatenated_embeddings_test.shape)

The embeddings are now of dimension 768.    

### Perform the similarity search     

With the embedding created the index can now be constructed from the training embeddings. Execute this code to build the index. Execution will take a significant amount of time.     

In [None]:
# L2 normalize the training vectors to get cosine distance from L2 dot product distance
faiss.normalize_L2(concatenated_embeddings_train)

# Build the index
D=concatenated_embeddings_train.shape[1]
m = 64
nbits = 8
nlist = 4096
index_IVFL2_PQ = build_index(D, m, nbits, nlist)

# Train the index and add the vectors
%time index_IVFL2_PQ.train(concatenated_embeddings_train)
index_IVFL2_PQ.add(concatenated_embeddings_train)
print(f'Index size: {get_memory(index_IVFL2_PQ)} MB')

The code in the cell below performs a query using the additional attribute features. Execute the code in the cell below.  

In [None]:
faiss.normalize_L2(concatenated_embeddings_test)

k=100
index_IVFL2_index = faiss.extract_index_ivf(index_IVFL2_PQ)
index_IVFL2_index.nprobe = 128
%time dist, indx = index_IVFL2_PQ.search(concatenated_embeddings_test[0,:].reshape(1,D), k=k)


results = movie_plots_train.loc[indx[0], ['Title', 'Plot']]
results['Distance'] = dist[0]
print_movies_distances(results.iloc[:10,:])


> **Exercise 5-07:**  Now, answer these questions.
> 1. As explained, we do not have independent ranking information for this data set, but we can make some qualitative conjectures about the quality of the ranking. Read the plot summaries and compare them to the plot summary of the query movie. Briefly describe the quality or general agreement of plot content between he query and the ranked responses.
> 2. Briefly explain how the addition attributes, such as genre, could have lead to better reposes to the queries.
> 3. Briefly explain how the missing or 'unknown' values in the attributes create ambiguity and reduce the recall of the query.    

> **Answer:**
> 1.         
> 1.      
> 2.          

The next step is to perform the reranking using the cross encoder. Execute the code in the cell below to perform the reranking and display the results.     

In [None]:
def compute_scores(model, query, targets):
  scores = []
  for string in targets:
    scores.append(model.predict([(query, string)])[0])
  return scores

pretrained_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
model_CrossEncoder = CrossEncoder(pretrained_model)


query_index = movie_plots_test.index[0]
scores = compute_scores(model_CrossEncoder, movie_plots_test.loc[query_index,'Plot'], results.loc[:,'Plot'])
scores[:10]

# Sort the results data frame based on assending scores
results['Score'] = scores
results.sort_values(by='Score', ascending=False, inplace=True)
plot_dists(results.loc[:,'Score'], ylabel='Score', title='Reranked score vs. rank')
results.iloc[:10,:]

> **Exercise 5-08:** Examine the results of the distances and score and answer these questions. 
> 1. Briefly explain what the respective ordering, or lack thereof, tell you about the the effect of the reranking?
> 2. Briefly explain what the shape of the curve and range of score values tell you about the reliability of the approximate nearest neighbor search results and the rankings.     
> 3. We have investigated using additional embedded attributes to improve approximate nearest neighbor search. Another possible way to improve search results is to filter the raw database by one or more categorical attribute. For example, in the case of movies the genre might be one way to do so. If you were to implement such a filter, what type of index or data structure would you use for efficient filtering at scale. What is a computational efficient approach to filter on keywords before doing the nearest neighbor search?

> **Answers:**      
> 1,             
> 2.          
> 3.         

#### Copyright 2025, Stephen F. Elston. All rights reserved.