<h1>Chapter 8 - Semantic Search and Retrieval-Augmented Generation</h1>
<i>Exploring a vital part of LLMs, search.</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter08/Chapter%208%20-%20Semantic%20Search.ipynb)



### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

ðŸ’¡ **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [2]:
# %%capture
!pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1 pandas python-dotenv
!pip install llama-cpp-python==0.2.78  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

## IMPORTANT: Make sure to restart the session after installing the packages above.

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124
Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124


## Setup

In [3]:
import cohere
import os
from dotenv import load_dotenv

load_dotenv()

# Get API key from environment variable
api_key = os.environ.get('COHERE_API_KEY')

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [4]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""

# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]

In [5]:
import numpy as np

# Get the embeddings
response = co.embed(
  texts=texts,
  input_type="search_document", # this means we use the embeddings to store in a vector database.
).embeddings

embeds = np.array(response)
print(embeds.shape)

(15, 4096)


In [6]:
import faiss

dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim) # Uses L2 Euclidean distance metric
index.add(np.float32(embeds))

In [7]:
import pandas as pd

def search(query, number_of_results=5):

  # 1. Get the query's embedding
  query_embed = co.embed(texts=[query],
                input_type="search_query",).embeddings[0]

  # 2. Retrieve the nearest neighbors
  distances , similar_item_ids = index.search(np.float32([query_embed]), number_of_results)

  # 3. Format the results
  texts_np = np.array(texts) # Convert texts list to numpy for easier indexing
  results = pd.DataFrame(data={'texts': texts_np[similar_item_ids[0]],
                              'distance': distances[0]})

  # 4. Print and return the results
  print(f"Query:'{query}'\nNearest neighbors:")
  return results

In [12]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm import tqdm

def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

# Build BM25 index
tokenized_corpus = []
for passage in tqdm(texts):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

def keyword_search(query, top_k=3, num_candidates=15):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print(f"Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:top_k]:
        print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 15/15 [00:00<00:00, 87502.87it/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 15/15 [00:00<00:00, 87502.87it/s]


In [11]:
def keyword_and_reranking_search(query, top_k=3, num_candidates=10):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print(f"Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:top_k]:
        print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))

    #Add re-ranking
    docs = [texts[hit['corpus_id']] for hit in bm25_hits]

    print(f"\nTop-3 hits by rank-API ({len(bm25_hits)} BM25 hits re-ranked)")
    results = co.rerank(query=query, documents=docs, top_n=top_k, return_documents=True)
    for hit in results.results:
        print("\t{:.3f}\t{}".format(hit.relevance_score, hit.document.text.replace("\n", " ")))

---

## Easy Tasks

Easy tasks to practice chapter 8 concepts

### Task 1: Testing Different Queries with Semantic Search

**Objective:** Experiment with different types of queries to understand how semantic search works.

**Instructions:**
1. Use the `search()` function with at least 3 different queries
2. Try queries about:
   - Awards/nominations
   - Filming locations
   - Actors
3. Observe the distance scores 

In [9]:
# Task 1: Your code here
# Test at least 3 different queries

# Example query 1: About awards
query1 = "awards and nominations"
results1 = search(query1, number_of_results=3)
print("\nQuery 1 Results:")
print(results1)

# TODO: Add your own queries below
# query2 = 
# query3 = 

Query:'awards and nominations'
Nearest neighbors:

Query 1 Results:
                                               texts      distance
0  Interstellar was nominated for five awards at ...   8574.546875
1  Caltech theoretical physicist and 2017 Nobel l...  10175.265625
2  It received acclaim for its performances, dire...  11402.931641


**Question:** Why do some queries return better results than others?

### Task 2: Compare Dense Retrieval vs BM25

**Objective:** Compare the results from semantic search (dense retrieval) and keyword search (BM25).

**Instructions:**
1. Choose a query and run it with both methods:
   - `search(query)` for dense retrieval
   - `keyword_search(query)` for BM25
2. Compare the top 3 results from each method
3. Are they the same? Different? Why?

**Example queries to try:**
- "Christopher Nolan director"
- "box office revenue"
- "visual effects"

In [16]:
# Task 2: Your code here
# Compare dense retrieval vs BM25

query = "box office revenue"

print("For  Dense Retrieval (Semantic Search):")
dense_results = search(query, number_of_results=3)
print(dense_results)

For  Dense Retrieval (Semantic Search):
Query:'box office revenue'
Nearest neighbors:
                                               texts      distance
0  The film had a worldwide gross over $677 milli...   7515.812012
1  In the United States, it was first released on...  11027.736328
2  It received acclaim for its performances, dire...  11393.539062
Query:'box office revenue'
Nearest neighbors:
                                               texts      distance
0  The film had a worldwide gross over $677 milli...   7515.812012
1  In the United States, it was first released on...  11027.736328
2  It received acclaim for its performances, dire...  11393.539062


In [17]:
print("For BM25 (Keyword Search):")
keyword_search(query, top_k=3)

For BM25 (Keyword Search):
Input question: box office revenue
Top-3 lexical search (BM25) hits
	0.000	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	0.000	It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine
	0.000	Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind


In [19]:
# TODO: Try with different queries and see the differences

## Task 3: Understanding Distance Metrics

**Objective:** Analyze how distance scores indicate relevance in semantic search.

**Instructions:**
1. Run a search query and examine the distance values
2. What does a smaller distance mean?
3. What does a larger distance mean?
4. Find the query that gives you the lowest average distance (most relevant results)

**Hint:** Lower distance = more similar/relevant in L2 distance metric

In [20]:
# Task 3: Your code here
# Analyze distance metrics

# Test multiple queries and compare average distances
queries_to_test = [
    "science fiction film",
    "quantum physics",  # Not relevant to Interstellar content
    "Christopher Nolan"
]

for q in queries_to_test:
    results = search(q, number_of_results=3)
    avg_distance = results['distance'].mean()
    print(f"\nQuery: '{q}'")
    print(f"Average distance: {avg_distance:.4f}")
    print(f"Distance range: {results['distance'].min():.4f} to {results['distance'].max():.4f}")

# TODO: What patterns do you notice about relevant vs irrelevant queries?

Query:'science fiction film'
Nearest neighbors:

Query: 'science fiction film'
Average distance: 7619.0942
Distance range: 6799.8867 to 8464.7129
Query:'quantum physics'
Nearest neighbors:

Query: 'quantum physics'
Average distance: 12993.4141
Distance range: 12195.9004 to 13420.5547
Query:'quantum physics'
Nearest neighbors:

Query: 'quantum physics'
Average distance: 12993.4141
Distance range: 12195.9004 to 13420.5547
Query:'Christopher Nolan'
Nearest neighbors:

Query: 'Christopher Nolan'
Average distance: 9192.8877
Distance range: 8420.4180 to 9918.2998
Query:'Christopher Nolan'
Nearest neighbors:

Query: 'Christopher Nolan'
Average distance: 9192.8877
Distance range: 8420.4180 to 9918.2998


### Task 4: Observe the Impact of Reranking

**Objective:** See how reranking improves search results.

**Instructions:**
1. Use `keyword_and_reranking_search()` with different queries
2. Compare the BM25 scores before reranking with relevance scores after reranking
3. Does the order change? By how much?
4. Which document gets the highest relevance score?

**Example queries:**
- "scientific accuracy physics"
- "movie release date"
- "actors cast members"

In [22]:
# Task 4: Your code here
# Observe reranking impact

query = "scientific accuracy physics"
print(f"Testing reranking with query: '{query}'\n")
keyword_and_reranking_search(query, top_k=3, num_candidates=10)


# TODO: Test with other queries

Testing reranking with query: 'scientific accuracy physics'

Input question: scientific accuracy physics
Top-3 lexical search (BM25) hits
	4.733	It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics
	1.373	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.000	In the United States, it was first released on film stock, expanding to venues using digital projectors

Top-3 hits by rank-API (10 BM25 hits re-ranked)
	0.590	It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics
	0.064	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.022	Interstellar is a 2014 epic science ficti