<h1>Chapter 8 - Semantic Search and Retrieval-Augmented Generation</h1>
<i>Exploring a vital part of LLMs, search.</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter08/Chapter%208%20-%20Semantic%20Search.ipynb)

---

This notebook is for Chapter 8 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [4]:
# # %%capture
# !pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1
# !pip install llama-cpp-python==0.2.78  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# # IMPORTANT: Make sure to restart the session after installing the packages above.

# Dense Retrieval Example


## 1. Getting the text archive and chunking it


In [5]:
import cohere

# Paste your API key here. Remember to not share publicly
api_key = 'c2MH59Ts190N6asRQfCKRqW4cVZHYkECC6QxHdnA'

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [6]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""

# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]

## 2. Embedding the Text Chunks


In [7]:
import numpy as np

# Get the embeddings
response = co.embed(
  texts=texts,
  input_type="search_document",
).embeddings

# Trasnform it to numpy array
embeds = np.array(response)
print(embeds.shape)

(15, 4096)


In [8]:
len(texts)

15

## 3. Building The Search Index


FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.

It’s often used to build vector databases for semantic search, retrieval, and nearest-neighbor lookups.

In [9]:
import faiss

#Get embedding dimension
dim = embeds.shape[1]
# Creates a flat index that stores vectors in RAM and uses L2 (Euclidean) distance for similarity search.
index = faiss.IndexFlatL2(dim)
# Converts embeddings to float32 (FAISS requires this type).
index.add(np.float32(embeds))

## 4. Search the index


In [10]:
## Try to find the sentences that are most similar based on my querym

In [11]:
import pandas as pd

def search(query, number_of_results=3):

  # 1. Get the query's embedding
  query_embed = co.embed(texts=[query],
                input_type="search_query",).embeddings[0]

  # 2. Retrieve the nearest neighbors
  distances , similar_item_ids = index.search(np.float32([query_embed]), number_of_results)

  # 3. Format the results
  texts_np = np.array(texts) # Convert texts list to numpy for easier indexing
  results = pd.DataFrame(data={'texts': texts_np[similar_item_ids[0]],
                              'distance': distances[0]})

  # 4. Print and return the results
  print(f"Query:'{query}'\nNearest neighbors:")
  return results

In [12]:
query = "how precise was the science"
results = search(query)
results

Query:'how precise was the science'
Nearest neighbors:


Unnamed: 0,texts,distance
0,It has also received praise from many astronom...,10757.371094
1,Caltech theoretical physicist and 2017 Nobel l...,11566.136719
2,Interstellar uses extensive practical and mini...,11922.841797


In [13]:
results.head().style

Unnamed: 0,texts,distance
0,It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics,10757.371094
1,"Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar",11566.136719
2,Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects,11922.841797


# Retrieval-Augmented Generation

## Example: Grounded Generation with an LLM API


In [14]:
results

Unnamed: 0,texts,distance
0,It has also received praise from many astronom...,10757.371094
1,Caltech theoretical physicist and 2017 Nobel l...,11566.136719
2,Interstellar uses extensive practical and mini...,11922.841797


In [15]:
query = "income generated" # Define the user's query

# 1- Retrieval
# We'll use embedding search. But ideally we'd do hybrid
results = search(query) # Perform semantic search to retrieve relevant text chunks

# 2- Grounded Generation
docs_dict = [{'text': text} for text in results['texts']] # Format retrieved texts for Cohere API
response = co.chat( # Call the Cohere chat API
    message = query, # Pass the user's query
    documents=docs_dict # Provide the retrieved documents for grounding
)

print(response.text) # Print the generated response

Query:'income generated'
Nearest neighbors:
The film Interstellar generated a worldwide gross of over $677 million, and $773 million with subsequent re-releases.


In [16]:
docs_dict

[{'text': 'The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'},
 {'text': 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar'},
 {'text': 'Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects'}]

In [17]:
response



In [18]:
response.citations

[ChatCitation(start=9, end=21, text='Interstellar', document_ids=['doc_2'], type='TEXT_CONTENT'),
 ChatCitation(start=34, end=70, text='worldwide gross of over $677 million', document_ids=['doc_0'], type='TEXT_CONTENT'),
 ChatCitation(start=76, end=117, text='$773 million with subsequent re-releases.', document_ids=['doc_0'], type='TEXT_CONTENT')]

## Example: RAG with Local Models


### Loading the Generation Model


In [19]:
!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf

--2025-08-27 20:58:39--  https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
Resolving huggingface.co (huggingface.co)... 3.165.160.11, 3.165.160.59, 3.165.160.61, ...
Connecting to huggingface.co (huggingface.co)|3.165.160.11|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cas-bridge.xethub.hf.co/xet-bridge-us/662698108f7573e6a6478546/df220524a4e4a750fe1c325e41f09ff69137f38b52d8831ba22dcbee3cc8ab6d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250827%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250827T205839Z&X-Amz-Expires=3600&X-Amz-Signature=c28fcf2041203c6c891b18eace1565a14a9b16b489bbc66e4e52b3b05d72b6a4&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Phi-3-mini-4k-instruct-q4.gguf%3B+filename%3D%22Phi-3-mini-4k-instruct-q4.gguf%22%3B&x-id=GetObject&Expires=1756331919&Policy=eyJTdGF0

In [20]:
from langchain import LlamaCpp # Import the LlamaCpp class for interacting with local GGUF models

llm = LlamaCpp(
    model_path="Phi-3-mini-4k-instruct-q4.gguf", # Path to the local GGUF model file
    n_gpu_layers=-1,  # Number of layers to offload to the GPU (-1 means all layers)
    max_tokens=500,  # Maximum number of tokens to generate in the response
    n_ctx=2048,      # Context window size (number of tokens the model considers)
    seed=42,         # Random seed for reproducibility
    verbose=False    # Whether to display verbose output from llama.cpp
)

### Loading the Embedding Model

In [21]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Embedding Model for converting text to numerical representations
embedding_model = HuggingFaceEmbeddings(
    model_name='BAAI/bge-small-en-v1.5'
)

### Preparing the Vector Database

In [24]:
from langchain.vectorstores import FAISS

# Create a local vector database using FAISS
# It takes the text chunks and the embedding model to generate and store embeddings
db = FAISS.from_texts(texts, embedding_model)

### The RAG Prompt


In [None]:
from langchain import PromptTemplate # Import PromptTemplate for creating custom prompts
from langchain.chains import RetrievalQA # Import RetrievalQA for building the RAG chain


# Create a prompt template
# This template defines how the retrieved context and the user question will be formatted for the LLM
template = """<|user|>
Relevant information:
{context}

Provide a concise answer the following question using the relevant information provided above:
{question}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"] # Specify the input variables for the template
)

# RAG Pipeline
# RetrievalQA chain combines a retriever (to get relevant documents) and an LLM (to generate the answer)
rag = RetrievalQA.from_chain_type(
    llm=llm, # The language model to use for generation
    chain_type='stuff', # 'stuff' chain type takes all retrieved documents and stuffs them into the prompt
    retriever=db.as_retriever(), # The retriever to fetch relevant documents from the vector database
    chain_type_kwargs={
        "prompt": prompt # Pass the custom prompt template to the chain
    },
    verbose=True # Set to True to see the steps the chain is taking
)

In [None]:
rag.invoke('Income generated')



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Income generated',
 'result': " Interstellar grossed over $677 million worldwide in 2014 and had additional earnings from subsequent re-releases, totaling approximately $773 million. The film's release utilized both traditional film stock and digital projectors across various venues to maximize its income generation potential."}