# CHAPTER 08 - Semantic Search and Retrieval-Augmented Generation

## Dense Retrieval

In [1]:
from dotenv import load_dotenv
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import cohere

load_dotenv()

True

In [2]:
co = cohere.Client(os.environ.get('COHERE_API_KEY'))

### Getting the text archive and chunking it

In [3]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and pro
duced by Christopher Nolan. 
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, 
Ellen Burstyn, Matt Damon, and Michael Caine. 
Set in a dystopian future where humanity is struggling to survive, the film 
follows a group of astronauts who travel through a wormhole near Saturn in 
search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its 
origins in a script Jonathan developed in 2007. 
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne 
was an executive producer, acted as a scientific consultant, and wrote a tie-in 
book, The Science of Interstellar. 
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision 
anamorphic format and IMAX 70 mm. 
Principal photography began in late 2013 and took place in Alberta, Iceland, 
and Los Angeles. 
Interstellar uses extensive practical and miniature effects and the company 
Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles. 
In the United States, it was first released on film stock, expanding to venues 
using digital projectors. 
The film had a worldwide gross over $677 million (and $773 million with subse
quent re-releases), making it the tenth-highest grossing film of 2014. 
It received acclaim for its performances, direction, screenplay, musical score, 
visual effects, ambition, themes, and emotional weight. 
It has also received praise from many astronomers for its scientific accuracy 
and portrayal of theoretical astrophysics. Since its premiere, Interstellar 
gained a cult following,[5] and now is regarded by many sci-fi experts as one 
of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning 
Best Visual Effects, and received numerous other accolades
"""

In [4]:
# Split into a list of sentences
texts = text.split('.')
texts

['\nInterstellar is a 2014 epic science fiction film co-written, directed, and pro\nduced by Christopher Nolan',
 ' \nIt stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, \nEllen Burstyn, Matt Damon, and Michael Caine',
 ' \nSet in a dystopian future where humanity is struggling to survive, the film \nfollows a group of astronauts who travel through a wormhole near Saturn in \nsearch of a new home for mankind',
 '\n\nBrothers Christopher and Jonathan Nolan wrote the screenplay, which had its \norigins in a script Jonathan developed in 2007',
 ' \nCaltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne \nwas an executive producer, acted as a scientific consultant, and wrote a tie-in \nbook, The Science of Interstellar',
 ' \nCinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision \nanamorphic format and IMAX 70 mm',
 ' \nPrincipal photography began in late 2013 and took place in Alberta, Iceland, \nand Los Angeles',
 '

In [5]:
# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]
texts = [t.replace('\n', '') for t in texts]
texts

['Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan',
 'It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
 'Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind',
 'Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007',
 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar',
 'Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm',
 'Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',
 'Interstellar uses extensive practical a

### Embedding the text chunks

In [6]:
# Get the embeddings
response = co.embed(
    texts=texts,
    input_type="search_document"
).embeddings

embeds = np.array(response)

In [7]:
embeds

array([[ 0.21228027, -1.2519531 ,  1.2949219 , ...,  1.5058594 ,
         0.88671875, -0.79052734],
       [ 2.3847656 , -0.5361328 ,  0.4560547 , ...,  0.8027344 ,
         0.1430664 , -0.21972656],
       [ 1.6035156 , -1.0830078 ,  1.0039062 , ...,  0.32128906,
        -1.4453125 , -0.3137207 ],
       ...,
       [ 2.0058594 , -0.42797852,  3.0253906 , ...,  0.84375   ,
        -1.9521484 , -0.23461914],
       [ 2.6191406 , -1.4667969 ,  1.8798828 , ...,  0.8066406 ,
        -0.19580078, -1.0117188 ],
       [ 0.5527344 , -2.6953125 ,  1.1621094 , ..., -1.7802734 ,
        -1.7197266 , -2.1289062 ]], shape=(15, 4096))

In [8]:
print(embeds.shape)

(15, 4096)


### Building the search index

In [9]:
import faiss

In [10]:
dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
print(index.is_trained)

True


In [11]:
index.add(np.float32(embeds))

### Search the index

In [12]:
def search(query, number_of_results=3):

    # 1. Get the query's embedding
    query_embed = co.embed(
        texts=[query],
        input_type="search_document"
    ).embeddings[0]

    # 2. Retrieve the nearest neighbors
    distances, similar_item_ids = index.search(
        np.float32([query_embed]),
        number_of_results
    )

    # 3. Format the results
    texts_np = np.array(texts) # Convert texts list to numpy for easier indexing
    results = pd.DataFrame(
        data={
            'texts': texts_np[similar_item_ids[0]],
            'distance': distances[0]
        }
    )

    # 4. Print and return the results
    print(f"Query: '{query}'\nNearest neighbors:")
    return results

In [13]:
# Write a query and search the texts
query = "How precise was the science"
results = search(query)
results

Query: 'How precise was the science'
Nearest neighbors:


Unnamed: 0,texts,distance
0,It has also received praise from many astronom...,10738.859375
1,Interstellar uses extensive practical and mini...,11887.107422
2,Cinematographer Hoyte van Hoytema shot it on 3...,12191.457031


In [14]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string

In [19]:
def bm25_tokenizer(text: str):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

In [21]:
tokenized_corpus = []
for passage in tqdm(texts):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)
bm25

100%|██████████| 15/15 [00:00<?, ?it/s]


<rank_bm25.BM25Okapi at 0x1e87b6e6f30>

In [26]:
def keyword_search(query, top_k=3, num_candidates=15):
    print(f"Input question: {query}")

    ####### BM25 search (lexical search) #######
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0: top_k]:
        print(f"\t{hit['score']:.3f}\t{texts[hit['corpus_id']]}")

In [27]:
keyword_search(query="how precise was the science")

Input question: how precise was the science
Top-3 lexical search (BM25) hits
	1.789	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	1.373	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.000	It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine


## Reranking

In [29]:
query = "how precise was the science"
results = co.rerank(
    query=query,
    documents=texts,
    top_n=3,
    return_documents=True
)

for idx, result in enumerate(results.results):
    print(f"{idx} | {result.relevance_score} | {result.document.text}")

0 | 0.16981852 | It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics
1 | 0.07004896 | The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014
2 | 0.0043994132 | Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar


## Retrieval-Augmented Generation (RAG)

### Example #1: Grounded Generation with an LLM API

In [30]:
query = "income generated"

# 1 - Retrieval
# We'll use embedding search. But ideally we'd do hybrid
results = search(query)

Query: 'income generated'
Nearest neighbors:


In [32]:
# 2 - Grounded Generation
docs_dict = [{'text': text} for text in results['texts']]
response = co.chat(
    message=query,
    documents=docs_dict
)

In [33]:
print(response.text)

The film grossed over $677 million worldwide, and $773 million with subsequent re-releases.


In [35]:
print(response.documents)

[{'id': 'doc_0', 'text': 'The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'}]


### Example #2: RAG with local models