# 🧭 **Introduction to Vector Search in Book Recommender App**

In this section, we will be using **LangChain** to implement vector search for our **book recommender app**. The goal is to build a vector search system that can recommend books based on the semantic similarity of their descriptions. To do this, we will utilize several powerful dependencies from the **LangChain ecosystem**.

### 🛠️ **Dependencies and Workflow**

We’re importing several key components that guide the process of building a vector database. Here’s a brief overview of how each dependency fits into our workflow:

1. **TextLoader (from `langchain_community.document_loaders`)**:
   - **Role**: This method helps load raw text into a format that can be easily processed by LangChain. For our book recommender, it will be used to load the descriptions of books and convert them into a structured format for further processing.
   
2. **CharacterTextSplitter (from `langchain_text_splitters`)**:
   - **Role**: After loading the raw book descriptions, the `CharacterTextSplitter` method is used to split long documents into smaller, manageable chunks. In our case, each chunk will represent a single book description, but in different contexts, this splitter can divide documents based on character count or other strategies. It’s a flexible tool to help organize larger text data.
   
3. **OpenAIEmbeddings (from `langchain_openai`)**:
   - **Role**: The `OpenAIEmbeddings` method will be used to convert each chunk of text (the book descriptions) into vector embeddings. These embeddings are numerical representations of the text, capturing the semantic meaning of the descriptions. By leveraging **OpenAI's models**, we will ensure that our vectors are well-represented and can be easily compared to one another for similarity.
   
4. **Chroma (from `langchain_chroma`)**:
   - **Role**: Once we have the embeddings, we need to store them in a database for efficient retrieval. **Chroma** is a widely-used open-source vector database that will allow us to store and manage the embeddings. It supports efficient similarity searches, which is critical for building the recommendation system in our app. LangChain provides easy integration with Chroma and other vector databases, offering a flexible solution for vector storage and retrieval.

### 🔄 **Workflow Overview**
1. **Text Loading**: Load the raw book descriptions using the **TextLoader**.
2. **Text Splitting**: Break down the text into smaller chunks using the **CharacterTextSplitter**.
3. **Embedding Generation**: Convert each chunk into vector embeddings using **OpenAIEmbeddings**.
4. **Vector Storage**: Store the embeddings in a vector database (we will use **Chroma**) for fast similarity search.

In [15]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

When working with sensitive information like **API keys**, it's important to keep them secure and not hard-code them into your codebase. The best practice is to store such sensitive information as **environment variables**, which can be easily accessed when needed. One convenient way to manage these environment variables in Python is by using the `dotenv` library.

The `dotenv` library allows you to store environment variables in a `.env` file, and then load them into your Python environment. This ensures that sensitive information, like API keys, is kept out of the code and can be easily configured without exposing it directly.

In [16]:
from dotenv import load_dotenv
load_dotenv()

True

In [22]:
import os
import pandas as pd
from tqdm import tqdm  # For progress bar


---

## **Reading the clean books dataset 📥**

In [None]:
books = pd.read_csv('datasets/books_cleaned.csv')
books

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description
0,9780002005883,0002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,False,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,0002261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,False,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...
2,9780006178736,0006178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,False,Rage of angels,"9780006178736 A memorable, mesmerizing heroine..."
3,9780006280897,0006280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,False,The Four Loves,9780006280897 Lewis' work on the nature of lov...
4,9780006280934,0006280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,False,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5192,9788172235222,8172235224,Mistaken Identity,Nayantara Sahgal,Indic fiction (English),http://books.google.com/books/content?id=q-tKP...,On A Train Journey Home To North India After L...,2003.0,2.93,324.0,0.0,False,Mistaken Identity,9788172235222 On A Train Journey Home To North...
5193,9788173031014,8173031010,Journey to the East,Hermann Hesse,Adventure stories,http://books.google.com/books/content?id=rq6JP...,This book tells the tale of a man who goes on ...,2002.0,3.70,175.0,24.0,False,Journey to the East,9788173031014 This book tells the tale of a ma...
5194,9788179921623,817992162X,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,http://books.google.com/books/content?id=c_7mf...,"Wisdom to Create a Life of Passion, Purpose, a...",2003.0,3.82,198.0,1568.0,False,The Monk Who Sold His Ferrari: A Fable About F...,9788179921623 Wisdom to Create a Life of Passi...
5195,9788185300535,8185300534,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999.0,4.51,531.0,104.0,False,I Am that: Talks with Sri Nisargadatta Maharaj,9788185300535 This collection of the timeless ...


---

## **Tagging Descriptions with ISBN for Seamless Filtering 📚🔍**

Tagging book descriptions with **ISBNs** acts as a unique, reliable identifier for each book, streamlining the process of filtering and searching. Rather than relying on costly and error-prone string matching, ISBNs offer a straightforward way to pinpoint specific books. This boosts efficiency and ensures consistency in identifying and retrieving data.

By using ISBN tags, we can:
- **Eliminate the need for string matching**, improving search speed and accuracy.
- **Ensure precision** in retrieving exact matches, avoiding ambiguities in descriptions.
- **Integrate easily** with external data sources like Goodreads or Amazon, enriching the data pipeline.

Incorporating ISBNs into the system enhances both performance and scalability, ensuring that the book recommender operates seamlessly and intelligently.

In [5]:
books['tagged_description'].head()

0    9780002005883 A NOVEL THAT READERS and critics...
1    9780002261982 A new 'Christie for Christmas' -...
2    9780006178736 A memorable, mesmerizing heroine...
3    9780006280897 Lewis' work on the nature of lov...
4    9780006280934 "In The Problem of Pain, C.S. Le...
Name: tagged_description, dtype: object

---

## **Saving Tagged Descriptions to a Text File for Use with TextLoader 📝**

The **TextLoader** method from LangChain is designed to work with raw text files, not directly with a **Pandas DataFrame**. Since the book descriptions are stored in the DataFrame, we need to extract and save them in a compatible format.

To bridge this gap, we need to:
1. **Extract tagged descriptions** (with ISBNs) from the DataFrame.
2. **Save them to a text file**, where each description is stored in a separate line.

This process ensures that the **TextLoader** can properly read the descriptions and work with the LangChain framework. Let's save our tagged descriptions and get them ready for the next steps in building the vector database.

In [7]:
books['tagged_description'].to_csv('datasets/tagged_descriptions.txt', 
                                   sep = '\n',
                                   index=False,
                                   header=False
                                   )

In [9]:
raw_documents = TextLoader('datasets/tagged_descriptions.txt').load()

## **Splitting the Text into Chunks ✂️**

Now that we have our tagged descriptions saved into a text file, the next step is **splitting** the text into manageable pieces using LangChain’s `CharacterTextSplitter`.

In [10]:
text_splitter = CharacterTextSplitter(chunk_size=0, chunk_overlap=0, separator='\n')
documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 1168, which is longer than the specified 0
Created a chunk of size 1214, which is longer than the specified 0
Created a chunk of size 373, which is longer than the specified 0
Created a chunk of size 309, which is longer than the specified 0
Created a chunk of size 483, which is longer than the specified 0
Created a chunk of size 482, which is longer than the specified 0
Created a chunk of size 960, which is longer than the specified 0
Created a chunk of size 188, which is longer than the specified 0
Created a chunk of size 843, which is longer than the specified 0
Created a chunk of size 296, which is longer than the specified 0
Created a chunk of size 197, which is longer than the specified 0
Created a chunk of size 881, which is longer than the specified 0
Created a chunk of size 1088, which is longer than the specified 0
Created a chunk of size 1189, which is longer than the specified 0
Created a chunk of size 304, which is longer than the specified 0
Create

- **`separator='\n'`** tells the splitter to break the text **at each newline** — perfect, because we saved each book description on a new line.
- **`chunk_size=0`** means **do not group multiple descriptions together**; keep each description as its own chunk (prioritize splitting over separator).
- **`chunk_overlap=0`** means **no overlap** between chunks; every description stays cleanly separated without repeating parts.

This setup keeps each description neatly isolated and ready for embedding. 🚀

In [11]:
documents[0]

Document(metadata={'source': 'datasets/tagged_descriptions.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s d

---

## 🧠 Building Our Vector Database with Chroma

Now that we have our book descriptions split and ready, the next step is to **embed** them and **store** them in a **vector database**.  
For this, we're using **Chroma**, a lightweight, open-source vector database that's very popular for working with document embeddings in LangChain.

Chroma allows us to efficiently store and search embeddings locally — no need for external services — making it perfect for prototyping apps like our book recommender.

### 🔐 API Key Setup: Don't Forget the `.env` File

Since we're making API calls to OpenAI, we need to **authenticate** using our API key.

Create a `.env` file in your project root with the following content:

```
OPENAI_API_KEY=your_openai_api_key_here
```

And don't forget to **load the environment variables** before using `OpenAIEmbeddings`

> 📦 `.env` keeps sensitive information secure and out of your main codebase.





In [24]:
if not os.path.exists('datasets/chroma_db'):
    os.mkdir('datasets/chroma_db')

In [27]:
embedding_model = OpenAIEmbeddings()

db_books = Chroma(
    collection_name='book_descriptions',
    embedding_function=embedding_model,
    persist_directory='datasets/chroma_db',
)

batch_size = 50

for i in tqdm(range(0, len(documents), batch_size)):
    batch = documents[i: i+batch_size]
    db_books.add_documents(batch)

100%|██████████| 104/104 [12:52<00:00,  7.43s/it]


In [35]:
db_books._collection.count() 

5197

### 🚨 Why Do We Need Batching?

When using models like OpenAI's `text-embedding-ada-002`, **there are strict rate limits** on how many tokens you can send per minute.  
If we try to embed a large number of documents at once, we'll hit a **RateLimitError**.

> ✅ **Batching** helps us process smaller, manageable chunks of data without exceeding API rate limits — making the upload safe, stable, and reliable.



### 💾 How Chroma Persists Data Locally

When you use Chroma with a `persist_directory` (like `datasets/chroma_db`), it saves your vector database **on disk** automatically.



In [31]:
root_dir = 'datasets/chroma_db/'

# Create a list to store file info
file_data = []

# Walk through the directory
for foldername, subfolders, filenames in os.walk(root_dir):
    for filename in filenames:
        file_path = os.path.join(foldername, filename)
        size_bytes = os.path.getsize(file_path)
        file_data.append({
            'File Name': filename,
            'File Path': file_path,
            'Size (KB)': round(size_bytes / 1024, 2),
        })

# Convert to a pandas DataFrame for a clean table
df_files = pd.DataFrame(file_data)

df_files

Unnamed: 0,File Name,File Path,Size (KB)
0,chroma.sqlite3,datasets/chroma_db/chroma.sqlite3,28792.0
1,data_level0.bin,datasets/chroma_db/add16ee0-97e5-49f5-b322-c16...,30683.59
2,length.bin,datasets/chroma_db/add16ee0-97e5-49f5-b322-c16...,19.53
3,link_lists.bin,datasets/chroma_db/add16ee0-97e5-49f5-b322-c16...,42.04
4,header.bin,datasets/chroma_db/add16ee0-97e5-49f5-b322-c16...,0.1
5,index_metadata.pickle,datasets/chroma_db/add16ee0-97e5-49f5-b322-c16...,281.26


#### 📦 Chroma DB Folder Contents

| File Name | Description |
|:---|:---|
| `chroma.sqlite3` | The main SQLite database that stores metadata about documents and embeddings. | 
| `data_level0.bin` | Binary file containing the raw vector embeddings stored efficiently for fast search. | 
| `length.bin` | Stores the lengths of each vector entry, used to efficiently index into `data_level0.bin`. | 
| `link_lists.bin` | Contains graph link information for ANN (Approximate Nearest Neighbors) search structure. | 
| `header.bin` | Small file that stores metadata about how the binary data is organized (dimensions, format). | 
| `index_metadata.pickle` | Pickle file that contains additional metadata about the vector index for Chroma's internal use. | 


---

## 📝 Reading the saved embeddings directly

Load the Chroma database from the specified `persist_directory` where it was saved. Also, initialize the embedding model you used to save the embeddings, such as `OpenAIEmbeddings`.

> **No Need to run the previous section to create embeddings in that case

In [39]:
embedding_model = OpenAIEmbeddings()

db_books = Chroma(
    collection_name='book_descriptions',
    embedding_function=embedding_model,
    persist_directory='datasets/chroma_db',
)

In [40]:
db_books._collection.count() 

5197

## 🔍 Querying Saved Embeddings

After saving embeddings into the Chroma database, you can query the stored data to find similar documents, make recommendations, or search for content related to a specific topic. Below are the steps to effectively query the Chroma database.

### ~🔍 **Perform a Similarity Search**

Once the database is loaded, you can query it by providing a textual query. The database will return the most similar documents based on the stored embeddings. Here's how you can do a similarity search:

In [41]:
query = "A book to teach children about nature."
docs = db_books.similarity_search(query=query, k = 10)
docs

[Document(id='24990ed6-c688-4105-a936-7a25f579ad86', metadata={'source': 'datasets/tagged_descriptions.txt'}, page_content='9780786808069 Children will discover the exciting world of their own backyard in this introduction to familiar animals from cats and dogs to bugs and frogs. The combination of photographs, illustrations, and fun facts make this an accessible and delightful learning experience.'),
 Document(id='ace9001a-88eb-4f52-93b8-4ad34fe3f54c', metadata={'source': 'datasets/tagged_descriptions.txt'}, page_content="9780786808380 Introduce your babies to birds, cats, dogs, and babies through fine art, illustration, and photographs. These books are a rare opportunity to expose little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing the baby to some basic -- and sometimes playful -- information about the subjects."),
 Document(id='3465f6a9-86d9-485b-b544-722d51533935', m

### 📚 **Fetching ISBNs of Similar Documents**

After retrieving similar documents from Chroma, each document's content begins with the corresponding ISBN (since we tagged descriptions with ISBNs earlier). To fetch full book details like title, author, and genre, we can extract the ISBN from the retrieved document and query our original `books` DataFrame.

In [42]:
books[books['isbn13'] == int(docs[0].page_content.split()[0].strip())]

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description
3747,9780786808069,786808063,Baby Einstein: Neighborhood Animals,Marilyn Singer;Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=X9a4P...,Children will discover the exciting world of t...,2001.0,3.89,16.0,180.0,False,Baby Einstein: Neighborhood Animals,9780786808069 Children will discover the excit...


Here’s what happens:
- `docs[0].page_content.split()[0].strip()` extracts the first word (the ISBN) from the document content.
- We convert it to an integer to match the `isbn13` column in the `books` DataFrame.
- Finally, we filter the DataFrame to get full book details for the matching ISBN.

### 🔍 **Combining Querying Functionality into a Function**

To make our semantic search workflow cleaner and reusable, we wrap the querying and retrieval logic into a single function: `retrieve_semantic_recommendations`.

Here’s what this function does:
- Takes a **query** string (what the user is searching for) and a **top_k** (how many results to return).
- Uses the Chroma database to perform a **similarity search** based on the query.
- Extracts the **ISBNs** from the matched documents.
- Looks up the corresponding book details from the `books` DataFrame.
- Returns a DataFrame of recommended books!

Now, with just one line, you can generate smart, semantic book recommendations based on any user query! 📚✨

In [43]:
def retrieve_semantic_recommendations(
    query: str,
    top_k: int = 10
) -> pd.DataFrame:
    recommendations = db_books.similarity_search(query=query, k = top_k)
    
    books_list = []
    for i in range(0, len(recommendations)):
        books_list += [int(recommendations[i].page_content.strip('"').split()[0].strip())]
        
    return books[books['isbn13'].isin(books_list)]

In [38]:
retrieve_semantic_recommendations(
    'A book to teach children about nature',
    10
)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description
442,9780067575208,006757520X,The Sense of Wonder,Rachel Carson,Nature,http://books.google.com/books/content?id=Zee5S...,"First published more than three decades ago, t...",1998.0,4.39,112.0,1160.0,False,The Sense of Wonder,9780067575208 First published more than three ...
3214,9780689861130,0689861133,"Moo, Baa, la la La!",Sandra Boynton,Animal sounds,http://books.google.com/books/content?id=Gz40A...,Children will love joining in and imitating th...,2004.0,4.2,14.0,28261.0,False,"Moo, Baa, la la La!",9780689861130 Children will love joining in an...
3581,9780763620875,0763620874,Judy Moody Saves the World!,Megan McDonald,Juvenile Fiction,http://books.google.com/books/content?id=xDIRB...,When Judy Moody gets serious about protecting ...,2004.0,4.03,160.0,5883.0,False,Judy Moody Saves the World!,9780763620875 When Judy Moody gets serious abo...
3747,9780786808069,0786808063,Baby Einstein: Neighborhood Animals,Marilyn Singer;Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=X9a4P...,Children will discover the exciting world of t...,2001.0,3.89,16.0,180.0,False,Baby Einstein: Neighborhood Animals,9780786808069 Children will discover the excit...
3748,9780786808373,0786808373,Baby Einstein: Birds,Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=0jxHP...,"Introducing your baby to birds, cats, dogs, an...",2002.0,3.78,20.0,9.0,False,Baby Einstein: Birds,"9780786808373 Introducing your baby to birds, ..."
3749,9780786808380,0786808381,Baby Einstein: Babies,Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=jv4NA...,"Introduce your babies to birds, cats, dogs, an...",2002.0,4.03,20.0,29.0,False,Baby Einstein: Babies,"9780786808380 Introduce your babies to birds, ..."
3750,9780786808397,078680839X,Baby Einstein: Dogs,Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=qut8t...,"Introduce your baby to birds, cats, dogs, and ...",2002.0,3.81,20.0,26.0,False,Baby Einstein: Dogs,"9780786808397 Introduce your baby to birds, ca..."
3751,9780786808717,0786808713,Baby Einstein: What Does Violet See? Raindrops...,Julie Aigner-Clark,Juvenile Fiction,http://books.google.com/books/content?id=95IIA...,A very special puddle sets Violet the mouse of...,2002.0,3.25,18.0,16.0,False,Baby Einstein: What Does Violet See? Raindrops...,9780786808717 A very special puddle sets Viole...
3760,9780786812912,0786812915,The Big Box,Toni Morrison;Slade Morrison,Juvenile Fiction,http://books.google.com/books/content?id=LyYKA...,"In her first illustrated book for children, th...",2002.0,3.95,48.0,375.0,False,The Big Box,9780786812912 In her first illustrated book fo...
3797,9780789458209,0789458209,Tree,David Burnie,Juvenile Nonfiction,http://books.google.com/books/content?id=Qwsqj...,Photographs and text explore the anatomy and l...,2000.0,4.07,64.0,5.0,False,Tree,9780789458209 Photographs and text explore the...
