# üß≠ **Introduction to Vector Search for Ramcharitmanas Book**

In this section, we will be using **LangChain** to implement vector search for our **Ramcharitmanas text**. The goal is to build a vector search system that can give us shlokas or padyas based on the semantic similarity of their meanings. To do this, we will utilize several powerful dependencies from the **LangChain ecosystem**.

### üõ†Ô∏è **Dependencies and Workflow**

We‚Äôre importing several key components that guide the process of building a vector database. Here‚Äôs a brief overview of how each dependency fits into our workflow:

1. **TextLoader (from `langchain_community.document_loaders`)**:
   - **Role**: This method helps load raw text into a format that can be easily processed by LangChain. For our padya finder, it will be used to load the meanings of padyas and convert them into a structured format for further processing.
   
2. **CharacterTextSplitter (from `langchain_text_splitters`)**:
   - **Role**: After loading the raw padya meanings, the `CharacterTextSplitter` method is used to split long documents into smaller, manageable chunks. In our case, each chunk will represent a single padya meaning, but in different contexts, this splitter can divide documents based on character count or other strategies. It‚Äôs a flexible tool to help organize larger text data.
   
3. **OpenAIEmbeddings (from `langchain_openai`)**:
   - **Role**: The `OpenAIEmbeddings` method will be used to convert each chunk of text (the padya meaning) into vector embeddings. These embeddings are numerical representations of the text, capturing the semantic meaning of each of the padya meanings. By leveraging **OpenAI's models**, we will ensure that our vectors are well-represented and can be easily compared to one another for similarity.
   
4. **Chroma (from `langchain_chroma`)**:
   - **Role**: Once we have the embeddings, we need to store them in a database for efficient retrieval. **Chroma** is a widely-used open-source vector database that will allow us to store and manage the embeddings. It supports efficient similarity searches, which is critical for building the recommendation system in our app. LangChain provides easy integration with Chroma and other vector databases, offering a flexible solution for vector storage and retrieval.

### üîÑ **Workflow Overview**
1. **Text Loading**: Load the raw meanings using the **TextLoader**.
2. **Text Splitting**: Break down the text into smaller chunks using the **CharacterTextSplitter**.
3. **Embedding Generation**: Convert each chunk into vector embeddings using **OpenAIEmbeddings**.
4. **Vector Storage**: Store the embeddings in a vector database (we will use **Chroma**) for fast similarity search.

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

When working with sensitive information like **API keys**, it's important to keep them secure and not hard-code them into your codebase. The best practice is to store such sensitive information as **environment variables**, which can be easily accessed when needed. One convenient way to manage these environment variables in Python is by using the `dotenv` library.

The `dotenv` library allows you to store environment variables in a `.env` file, and then load them into your Python environment. This ensures that sensitive information, like API keys, is kept out of the code and can be easily configured without exposing it directly.

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
import os
import pandas as pd
from tqdm import tqdm  # For progress bar

---

## **Reading the clean ramcharitmanas dataset üì•**

In [4]:
padyas = pd.read_csv('datasets/ramcharitmanas_df.csv')
padyas

Unnamed: 0,Kand,Verse,Meaning,Verse Type,Verse Count,Page Number,id,tagged_meaning
0,‡§¨‡§æ‡§≤‡§ï‡§æ‡§£‡•ç‡§°,‡§µ‡§∞‡•ç‡§£‡§æ‡§®‡§æ‡§Æ‡§∞‡•ç‡§•‡§∏‡§Ç‡§ò‡§æ‡§®‡§æ‡§Ç ‡§∞‡§∏‡§æ‡§®‡§æ‡§Ç ‡§õ‡§®‡•ç‡§¶‡§∏‡§æ‡§Æ‡§™‡§ø‡•§\n‡§Æ‡§°‡§º‡§≤‡§æ‡§®‡§æ‡§Ç...,"‡§Ö‡§ï‡•ç‡§∑‡§∞‡•ã‡§Ç, ‡§Ö‡§∞‡•ç‡§•‡§∏‡§Æ‡•Ç‡§π‡•ã‡§Ç, ‡§∞‡§∏‡•ã‡§Ç, ‡§õ‡§®‡•ç‚Äç‡•ç‡§¶‡•ã‡§Ç ‡§î‡§∞ ‡§Æ‡§Ç‡§ó‡§≤‡•ã‡§Ç ...",‡§∂‡•ç‡§≤‡•ã‡§ï,‡•ß,‡•ß‡•≠,0,"0 ‡§Ö‡§ï‡•ç‡§∑‡§∞‡•ã‡§Ç, ‡§Ö‡§∞‡•ç‡§•‡§∏‡§Æ‡•Ç‡§π‡•ã‡§Ç, ‡§∞‡§∏‡•ã‡§Ç, ‡§õ‡§®‡•ç‚Äç‡•ç‡§¶‡•ã‡§Ç ‡§î‡§∞ ‡§Æ‡§Ç‡§ó‡§≤‡•ã..."
1,‡§¨‡§æ‡§≤‡§ï‡§æ‡§£‡•ç‡§°,‡§≠‡§µ‡§æ‡§®‡•Ä‡§∂‡§°‡•ç‡•Ç‡§∞‡•ã‡•å ‡§µ‡§®‡•ç‡§¶‡•á ‡§∂‡•ç‡§∞‡§¶‡•ç‡§ß‡§æ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏‡§∞‡•Ç‡§™‡§ø‡§£‡•å‡•§\n‡§Ø‡§æ...,‡§∂‡•ç‡§∞‡§¶‡•ç‡§ß‡§æ ‡§î‡§∞ ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ‡§ï‡•á ‡§∏‡•ç‡§µ‡§∞‡•Ç‡§™ ‡§∂‡•ç‡§∞‡•Ä‡§™‡§æ‡§∞‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î‡§∞ ...,‡§∂‡•ç‡§≤‡•ã‡§ï,‡•®,‡•ß‡•≠,1,1 ‡§∂‡•ç‡§∞‡§¶‡•ç‡§ß‡§æ ‡§î‡§∞ ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏‡§ï‡•á ‡§∏‡•ç‡§µ‡§∞‡•Ç‡§™ ‡§∂‡•ç‡§∞‡•Ä‡§™‡§æ‡§∞‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î‡§∞...
2,‡§¨‡§æ‡§≤‡§ï‡§æ‡§£‡•ç‡§°,‡§µ‡§®‡•ç‡§¶‡•á ‡§¨‡•ã‡§ß‡§Æ‡§Ø‡§Ç ‡§®‡§ø‡§§‡•ç‡§Ø‡§Ç ‡§ó‡•Å‡§∞‡•Å ‡§∂‡§°‡•ç‡§°‡•Ç‡§∞‡§∞‡•Ç‡§™‡§ø‡§£‡§Æ‡•ç‚Äå‡•§\n‡§Ø‡§Æ‡§æ‡§•...,"‡§ú‡•ç‡§û‡§æ‡§®‡§Æ‡§Ø, ‡§®‡§ø‡§§‡•ç‡§Ø, ‡§∂‡§°‡•ç‡§°‡•Ç‡§∞‡§∞‡•Ç‡§™‡•Ä ‡§ó‡•Å‡§∞‡•Å ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ ...",‡§∂‡•ç‡§≤‡•ã‡§ï,‡•©,‡•ß‡•≠,2,"2 ‡§ú‡•ç‡§û‡§æ‡§®‡§Æ‡§Ø, ‡§®‡§ø‡§§‡•ç‡§Ø, ‡§∂‡§°‡•ç‡§°‡•Ç‡§∞‡§∞‡•Ç‡§™‡•Ä ‡§ó‡•Å‡§∞‡•Å‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ..."
3,‡§¨‡§æ‡§≤‡§ï‡§æ‡§£‡•ç‡§°,‡§∏‡•Ä‡§§‡§æ‡§∞‡§æ‡§Æ‡§ó‡•Å‡§£‡§ó‡•ç‡§∞‡§æ‡§Æ‡§™‡•Å‡§£‡•ç‡§Ø‡§æ‡§∞‡§£‡•ç‡§Ø‡§µ‡§ø‡§π‡§æ‡§∞‡§ø‡§£‡•å‡•§\n‡§µ‡§®‡•ç‡§¶‡•á ‡§µ‡§ø‡§∂‡•Å...,‡§∂‡•ç‡§∞‡•Ä‡§∏‡•Ä‡§§‡§æ‡§∞‡§æ‡§Æ‡§ú‡•Ä ‡§ï‡•á ‡§ó‡•Å‡§£‡§∏‡§Æ‡•Ç‡§π‡§∞‡•Ç‡§™‡•Ä ‡§™‡§µ‡§ø‡§§‡•ç‡§∞ ‡§µ‡§® ‡§Æ‡•á‡§Ç ‡§µ‡§ø‡§π...,‡§∂‡•ç‡§≤‡•ã‡§ï,‡•™,‡•ß‡•≠,3,3 ‡§∂‡•ç‡§∞‡•Ä‡§∏‡•Ä‡§§‡§æ‡§∞‡§æ‡§Æ‡§ú‡•Ä‡§ï‡•á ‡§ó‡•Å‡§£‡§∏‡§Æ‡•Ç‡§π‡§∞‡•Ç‡§™‡•Ä ‡§™‡§µ‡§ø‡§§‡•ç‡§∞ ‡§µ‡§®‡§Æ‡•á‡§Ç ‡§µ‡§ø‡§π...
4,‡§¨‡§æ‡§≤‡§ï‡§æ‡§£‡•ç‡§°,‡§â‡§¶‡•ç‡§ß‡§µ‡§∏‡•ç‡§•‡§ø‡§§‡§ø‡§∏‡§Ç‡§π‡§æ‡§∞‡§ï‡§æ‡§∞‡§ø‡§£‡•Ä‡§Ç ‡§ï‡•ç‡§≤‡•á‡§∂‡§π‡§æ‡§∞‡§ø‡§£‡•Ä‡§Æ‡•ç‚Äå‡•§\n‡§∏‡§∞‡•ç‡§µ‡§∂...,"‡§â‡§§‡•ç‡§™‡§§‡•ç‡§§‡§ø, ‡§∏‡•ç‡§•‡§ø‡§§‡§ø (‡§™‡§æ‡§≤‡§®) ‡§î‡§∞ ‡§∏‡§Ç‡§π‡§æ‡§∞ ‡§ï‡§∞‡§®‡•á ‡§µ‡§æ‡§≤‡•Ä, ‡§ï‡•ç...",‡§∂‡•ç‡§≤‡•ã‡§ï,‡•´,‡•ß‡•Æ,4,"4 ‡§â‡§§‡•ç‡§™‡§§‡•ç‡§§‡§ø, ‡§∏‡•ç‡§•‡§ø‡§§‡§ø (‡§™‡§æ‡§≤‡§®) ‡§î‡§∞ ‡§∏‡§Ç‡§π‡§æ‡§∞ ‡§ï‡§∞‡§®‡•á‡§µ‡§æ‡§≤‡•Ä, ‡§ï..."
...,...,...,...,...,...,...,...,...
6157,‡§â‡§§‡•ç‡§§‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§Æ‡•ã ‡§∏‡§Æ ‡§¶‡•Ä‡§® ‡§® ‡§¶‡•Ä‡§® ‡§π‡§ø‡§§ ‡§§‡•Å‡§Æ‡•ç‡§π ‡§∏‡§Æ‡§æ‡§® ‡§∞‡§ò‡•Å‡§¨‡•Ä‡§∞‡•§\n‡§Ö‡§∏ ‡§¨‡§ø‡§ö...,‡§π‡•á ‡§∂‡•ç‡§∞‡•Ä‡§∞‡§ò‡•Å‡§µ‡•Ä‡§∞ ! ‡§Æ‡•á‡§∞‡•á ‡§∏‡§Æ‡§æ‡§® ‡§ï‡•ã‡§à ‡§¶‡•Ä‡§® ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à ‡§î‡§∞ ‡§Ü...,‡§¶‡•ã‡•¶,‡•ß‡•©‡•¶ (‡§ï),‡•ß‡•¶‡•™‡•¨,6157,6157 ‡§π‡•á ‡§∂‡•ç‡§∞‡•Ä‡§∞‡§ò‡•Å‡§µ‡•Ä‡§∞ ! ‡§Æ‡•á‡§∞‡•á ‡§∏‡§Æ‡§æ‡§® ‡§ï‡•ã‡§à ‡§¶‡•Ä‡§® ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à...
6158,‡§â‡§§‡•ç‡§§‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§ï‡§æ‡§Æ‡§ø‡§π‡§ø ‡§®‡§æ‡§∞‡§ø ‡§™‡§ø‡§Ü‡§∞‡§ø ‡§ú‡§ø‡§Æ‡§ø ‡§≤‡•ã‡§≠‡§ø‡§π‡§ø ‡§™‡•ç‡§∞‡§ø‡§Ø ‡§ú‡§ø‡§Æ‡§ø ‡§¶‡§æ‡§Æ‡•§\...,‡§ú‡•à ‡§∏‡•á ‡§ï‡§æ‡§Æ‡•Ä ‡§ï‡•ã ‡§∏‡•ç‡§§‡•ç‡§∞‡•Ä ‡§™‡•ç‡§∞‡§ø‡§Ø ‡§≤‡§ó‡§§‡•Ä ‡§π‡•à ‡§î‡§∞ ‡§≤‡•ã‡§≠‡•Ä ‡§ï‡•ã ...,‡§¶‡•ã‡•¶,‡•ß‡•©‡•¶ (‡§ñ),‡•ß‡•¶‡•™‡•¨,6158,6158 ‡§ú‡•à‡§∏‡•á ‡§ï‡§æ‡§Æ‡•Ä‡§ï‡•ã ‡§∏‡•ç‡§§‡•ç‡§∞‡•Ä ‡§™‡•ç‡§∞‡§ø‡§Ø ‡§≤‡§ó‡§§‡•Ä ‡§π‡•à ‡§î‡§∞ ‡§≤‡•ã‡§≠‡•Ä‡§ï...
6159,‡§â‡§§‡•ç‡§§‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§Ø ‡§§‡•ç‡§™‡•Ç‡§∞‡•ç‡§µ ‡§™‡•ç‡§∞‡§≠‡•Å‡§£‡§æ ‡§ï‡•É‡§§‡§Ç ‡§∏‡•Å‡§ï‡§µ‡§ø‡§®‡§æ ‡§∂‡•ç‡§∞‡•Ä‡§∂‡§Æ‡•ç‡§≠‡•Å‡§®‡§æ ‡§¶‡•Å‡§∞...,‡§∂‡•ç‡§∞‡•á‡§∑‡•ç‡§† ‡§ï‡§µ‡§ø ‡§≠‡§ó‡§µ‡§æ‡§®‡•ç‚Äå ‡§∂‡•ç‡§∞‡•Ä‡§∂‡§Ç‡§ï‡§∞‡§ú‡•Ä‡§®‡•á ‡§™‡§π‡§≤‡•á ‡§ú‡§ø‡§∏ ‡§¶‡•Å‡§∞‡•ç...,‡§∂‡•ç‡§≤‡•ã‡§ï,‡•ß,‡•ß‡•¶‡•™‡•¨,6159,6159 ‡§∂‡•ç‡§∞‡•á‡§∑‡•ç‡§† ‡§ï‡§µ‡§ø ‡§≠‡§ó‡§µ‡§æ‡§®‡•ç‚Äå ‡§∂‡•ç‡§∞‡•Ä‡§∂‡§Ç‡§ï‡§∞‡§ú‡•Ä‡§®‡•á ‡§™‡§π‡§≤‡•á ‡§ú‡§ø‡§∏...
6160,‡§â‡§§‡•ç‡§§‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§™‡•Å‡§£‡•ç‡§Ø‡§Ç ‡§™‡§æ‡§™‡§π‡§∞‡§Ç ‡§∏‡§¶‡§æ ‡§∂‡§ø‡§µ‡§ï‡§∞‡§Ç ‡§µ‡§ø‡§ú‡•ç‡§û‡§æ‡§®‡§≠‡§ï‡•ç‡§§‡§ø‡§™‡•ç‡§∞‡§¶‡§Ç‡•§\n‡§Æ...,"‡§Ø‡§π ‡§∂‡•ç‡§∞‡•Ä‡§∞‡§æ‡§Æ‡§ö‡§∞‡§ø‡§§‡§Æ‡§æ‡§®‡§∏ ‡§™‡•Å‡§£‡•ç‡§Ø‡§∞‡•Ç‡§™, ‡§™‡§æ‡§™‡•ã‡§Ç ‡§ï‡§æ ‡§π‡§∞‡§£ ‡§ï‡§∞‡§®‡•á...",‡§∂‡•ç‡§≤‡•ã‡§ï,‡•®,‡•ß‡•¶‡•™‡•¨,6160,"6160 ‡§Ø‡§π ‡§∂‡•ç‡§∞‡•Ä‡§∞‡§æ‡§Æ‡§ö‡§∞‡§ø‡§§‡§Æ‡§æ‡§®‡§∏ ‡§™‡•Å‡§£‡•ç‡§Ø‡§∞‡•Ç‡§™, ‡§™‡§æ‡§™‡•ã‡§Ç‡§ï‡§æ ‡§π‡§∞‡§£ ..."


----

## **Cleaning the verse meanings**

In [5]:
# Hindi postpositions
postpositions = [
    '‡§ï‡§æ', '‡§ï‡•Ä', '‡§ï‡•á', '‡§Æ‡•á‡§Ç', '‡§∏‡•á', '‡§™‡§∞', '‡§ï‡•ã', '‡§§‡§ï', '‡§¨‡§ø‡§®‡§æ', '‡§∏‡§æ‡§•', '‡§µ‡§æ‡§≤‡•á', '‡§µ‡§æ‡§≤‡•Ä', '‡§µ‡§æ‡§≤‡§æ'
]

for idx in range(len(padyas['Meaning'])):
    meaning = padyas.loc[idx, 'Meaning']
    tokens = [word for word in meaning.split(' ') if len(word) >= 3]
    for token in tokens:
        for post in postpositions:
            if token.endswith(post) and len(token) > len(post):
                root = token[:-len(post)]
                padyas.loc[idx, 'Meaning'] = padyas.loc[idx, 'Meaning'].replace(token, root + ' ' + post)

In [6]:
padyas.loc[0, 'Meaning']

'‡§Ö‡§ï‡•ç‡§∑‡§∞‡•ã‡§Ç, ‡§Ö‡§∞‡•ç‡§•‡§∏‡§Æ‡•Ç‡§π‡•ã‡§Ç, ‡§∞‡§∏‡•ã‡§Ç, ‡§õ‡§®‡•ç\u200d‡•ç‡§¶‡•ã‡§Ç ‡§î‡§∞ ‡§Æ‡§Ç‡§ó‡§≤‡•ã‡§Ç ‡§ï‡•Ä ‡§ï‡§∞‡§®‡•á ‡§µ‡§æ‡§≤‡•Ä ‡§∏‡§∞‡§∏‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î‡§∞ ‡§ó‡§£‡•á‡§∂‡§ú‡•Ä ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ ‡§ï‡§∞‡§§‡§æ ‡§π‡•Ç‡§Å‡••'

---

## **Tagging Descriptions with Id for Seamless Filtering üìöüîç**

Tagging padya with **ID** acts as a unique, reliable identifier for each padya, streamlining the process of filtering and searching. Rather than relying on costly and error-prone string matching, ID offer a straightforward way to pinpoint specific padyas. This boosts efficiency and ensures consistency in identifying and retrieving data.

By using ID tags, we can:
- **Eliminate the need for string matching**, improving search speed and accuracy.
- **Ensure precision** in retrieving exact matches, avoiding ambiguities in descriptions.
- **Integrate easily** with external data sources like Goodreads or Amazon, enriching the data pipeline.

Incorporating IDs into the system enhances both performance and scalability, ensuring that the padya finder operates seamlessly and intelligently.

In [7]:
padyas['id'] = list(range(0, len(padyas)))

In [8]:
padyas.to_csv('datasets/ramcharitmanas_df.csv', index=False)

In [9]:
padyas['tagged_meaning'] = padyas['id'].astype(str) + ' ' + padyas['Meaning']

In [10]:
padyas['tagged_meaning'].head()

0    0 ‡§Ö‡§ï‡•ç‡§∑‡§∞‡•ã‡§Ç, ‡§Ö‡§∞‡•ç‡§•‡§∏‡§Æ‡•Ç‡§π‡•ã‡§Ç, ‡§∞‡§∏‡•ã‡§Ç, ‡§õ‡§®‡•ç‚Äç‡•ç‡§¶‡•ã‡§Ç ‡§î‡§∞ ‡§Æ‡§Ç‡§ó‡§≤‡•ã...
1    1 ‡§∂‡•ç‡§∞‡§¶‡•ç‡§ß‡§æ ‡§î‡§∞ ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ‡§ï‡•á ‡§∏‡•ç‡§µ‡§∞‡•Ç‡§™ ‡§∂‡•ç‡§∞‡•Ä‡§™‡§æ‡§∞‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î...
2    2 ‡§ú‡•ç‡§û‡§æ‡§®‡§Æ‡§Ø, ‡§®‡§ø‡§§‡•ç‡§Ø, ‡§∂‡§°‡•ç‡§°‡•Ç‡§∞‡§∞‡•Ç‡§™‡•Ä ‡§ó‡•Å‡§∞‡•Å ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®...
3    3 ‡§∂‡•ç‡§∞‡•Ä‡§∏‡•Ä‡§§‡§æ‡§∞‡§æ‡§Æ‡§ú‡•Ä ‡§ï‡•á ‡§ó‡•Å‡§£‡§∏‡§Æ‡•Ç‡§π‡§∞‡•Ç‡§™‡•Ä ‡§™‡§µ‡§ø‡§§‡•ç‡§∞ ‡§µ‡§® ‡§Æ‡•á‡§Ç ‡§µ...
4    4 ‡§â‡§§‡•ç‡§™‡§§‡•ç‡§§‡§ø, ‡§∏‡•ç‡§•‡§ø‡§§‡§ø (‡§™‡§æ‡§≤‡§®) ‡§î‡§∞ ‡§∏‡§Ç‡§π‡§æ‡§∞ ‡§ï‡§∞‡§®‡•á ‡§µ‡§æ‡§≤‡•Ä, ...
Name: tagged_meaning, dtype: object

In [11]:
for i, row in padyas.iterrows():
    if '\n' in row['tagged_meaning']:
        padyas.loc[i, 'Meaning'] = padyas.loc[i, 'Meaning'].replace('\n', ' ')
        padyas.loc[i, 'tagged_meaning'] = padyas.loc[i, 'tagged_meaning'].replace('\n', ' ')
        
# print(sorted(list(set(all_chs))))

In [12]:
all_chs = []
for _, row in padyas.iterrows():
    for ch in row['tagged_meaning']:
        all_chs.append(ch)
print(sorted(list(set(all_chs))))

[' ', '!', '"', "'", '(', ')', '*', ',', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ';', '?', '[', ']', '‡§Å', '‡§Ç', '‡§É', '‡§Ö', '‡§Ü', '‡§á', '‡§à', '‡§â', '‡§ä', '‡§ã', '‡§è', '‡§ê', '‡§ì', '‡§î', '‡§ï', '‡§ñ', '‡§ó', '‡§ò', '‡§ö', '‡§õ', '‡§ú', '‡§ù', '‡§û', '‡§ü', '‡§†', '‡§°', '‡§¢', '‡§£', '‡§§', '‡§•', '‡§¶', '‡§ß', '‡§®', '‡§™', '‡§´', '‡§¨', '‡§≠', '‡§Æ', '‡§Ø', '‡§∞', '‡§≤', '‡§µ', '‡§∂', '‡§∑', '‡§∏', '‡§π', '‡§º', '‡§Ω', '‡§æ', '‡§ø', '‡•Ä', '‡•Å', '‡•Ç', '‡•É', '‡•á', '‡•à', '‡•â', '‡•ã', '‡•å', '‡•ç', '‡•ê', '‡•ë', '‡•í', '‡•§', '‡••', '‡•¶', '‡•ß', '‡•®', '‡•©', '‡•™', '‡•´', '‡•¨', '\u200c', '\u200d', '‚Äú', '‚Äù']


---

## **Saving Tagged Descriptions to a Text File for Use with TextLoader üìù**

The **TextLoader** method from LangChain is designed to work with raw text files, not directly with a **Pandas DataFrame**. Since the book descriptions are stored in the DataFrame, we need to extract and save them in a compatible format.

To bridge this gap, we need to:
1. **Extract tagged meanings** (with IDs) from the DataFrame.
2. **Save them to a text file**, where each description is stored in a separate line.

This process ensures that the **TextLoader** can properly read the meanings and work with the LangChain framework. Let's save our tagged descriptions and get them ready for the next steps in building the vector database.

In [13]:
tagged_meanings = list(padyas['tagged_meaning'])
tagged_meanings = [(meaning + '\n') for meaning in tagged_meanings]
tagged_meanings[:5]

['0 ‡§Ö‡§ï‡•ç‡§∑‡§∞‡•ã‡§Ç, ‡§Ö‡§∞‡•ç‡§•‡§∏‡§Æ‡•Ç‡§π‡•ã‡§Ç, ‡§∞‡§∏‡•ã‡§Ç, ‡§õ‡§®‡•ç\u200d‡•ç‡§¶‡•ã‡§Ç ‡§î‡§∞ ‡§Æ‡§Ç‡§ó‡§≤‡•ã‡§Ç ‡§ï‡•Ä ‡§ï‡§∞‡§®‡•á ‡§µ‡§æ‡§≤‡•Ä ‡§∏‡§∞‡§∏‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î‡§∞ ‡§ó‡§£‡•á‡§∂‡§ú‡•Ä ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ ‡§ï‡§∞‡§§‡§æ ‡§π‡•Ç‡§Å‡••\n',
 '1 ‡§∂‡•ç‡§∞‡§¶‡•ç‡§ß‡§æ ‡§î‡§∞ ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ‡§ï‡•á ‡§∏‡•ç‡§µ‡§∞‡•Ç‡§™ ‡§∂‡•ç‡§∞‡•Ä‡§™‡§æ‡§∞‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î‡§∞ ‡§∂‡•ç‡§∞‡•Ä‡§∂‡§°‡•ç‡§°‡§∞‡§ú‡•Ä ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ ‡§ï‡§∞‡§§‡§æ ‡§π‡•Ç‡§Å, ‡§ú‡§ø‡§® ‡§ï‡•á ‡§¨‡§ø‡§®‡§æ ‡§∏‡§ø‡§¶‡•ç‡§ß‡§ú‡§® ‡§Ö‡§™‡§®‡•á ‡§Ö‡§®‡•ç‡§§‡§É‡§ï‡§∞‡§£ ‡§Æ‡•á‡§Ç ‡§∏‡•ç‡§•‡§ø‡§§ ‡§à‡§∂‡•ç‡§µ‡§∞ ‡§ï‡•ã ‡§®‡§π‡•Ä‡§Ç ‡§¶‡•á‡§ñ ‡§∏‡§ï‡§§‡•á‡••\n',
 '2 ‡§ú‡•ç‡§û‡§æ‡§®‡§Æ‡§Ø, ‡§®‡§ø‡§§‡•ç‡§Ø, ‡§∂‡§°‡•ç‡§°‡•Ç‡§∞‡§∞‡•Ç‡§™‡•Ä ‡§ó‡•Å‡§∞‡•Å ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ ‡§ï‡§∞‡§§‡§æ ‡§π‡•Ç‡§Å, ‡§ú‡§ø‡§® ‡§ï‡•á ‡§Ü‡§∂‡•ç‡§∞‡§ø‡§§ ‡§π‡•ã‡§®‡•á ‡§∏‡•á ‡§π‡•Ä ‡§ü‡•á‡§¢‡§º‡§æ ‡§ö‡§®‡•ç‡§¶‡•ç‡§∞‡§Æ‡§æ ‡§≠‡•Ä ‡§∏‡§∞‡•ç‡§µ‡§§‡•ç‡§∞ ‡§µ‡§®‡•ç‡§¶‡§ø‡§§ ‡§π‡•ã‡§§‡§æ ‡§π

In [14]:
with open('datasets/tagged_meanings.txt', 'w') as file:
    file.writelines(tagged_meanings)

In [15]:
raw_documents = TextLoader('datasets/tagged_meanings.txt').load()

## **Splitting the Text into Chunks ‚úÇÔ∏è**

Now that we have our tagged meanings saved into a text file, the next step is **splitting** the text into manageable pieces using LangChain‚Äôs `CharacterTextSplitter`.

In [16]:
text_splitter = CharacterTextSplitter(chunk_size=0, chunk_overlap=0, separator='\n')
documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 104, which is longer than the specified 0
Created a chunk of size 151, which is longer than the specified 0
Created a chunk of size 124, which is longer than the specified 0
Created a chunk of size 155, which is longer than the specified 0
Created a chunk of size 165, which is longer than the specified 0
Created a chunk of size 348, which is longer than the specified 0
Created a chunk of size 210, which is longer than the specified 0
Created a chunk of size 170, which is longer than the specified 0
Created a chunk of size 188, which is longer than the specified 0
Created a chunk of size 167, which is longer than the specified 0
Created a chunk of size 188, which is longer than the specified 0
Created a chunk of size 180, which is longer than the specified 0
Created a chunk of size 216, which is longer than the specified 0
Created a chunk of size 225, which is longer than the specified 0
Created a chunk of size 234, which is longer than the specified 0
Created a 

In [17]:
len(documents)

6162

- **`separator='\n'`** tells the splitter to break the text **at each newline** ‚Äî perfect, because we saved each padya meaning on a new line.
- **`chunk_size=0`** means **do not group multiple meanings together**; keep each meaning as its own chunk (prioritize splitting over separator).
- **`chunk_overlap=0`** means **no overlap** between chunks; every meaning stays cleanly separated without repeating parts.

This setup keeps each meaning neatly isolated and ready for embedding. üöÄ

In [18]:
documents[0]

Document(metadata={'source': 'datasets/tagged_meanings.txt'}, page_content='0 ‡§Ö‡§ï‡•ç‡§∑‡§∞‡•ã‡§Ç, ‡§Ö‡§∞‡•ç‡§•‡§∏‡§Æ‡•Ç‡§π‡•ã‡§Ç, ‡§∞‡§∏‡•ã‡§Ç, ‡§õ‡§®‡•ç\u200d‡•ç‡§¶‡•ã‡§Ç ‡§î‡§∞ ‡§Æ‡§Ç‡§ó‡§≤‡•ã‡§Ç ‡§ï‡•Ä ‡§ï‡§∞‡§®‡•á ‡§µ‡§æ‡§≤‡•Ä ‡§∏‡§∞‡§∏‡•ç‡§µ‡§§‡•Ä‡§ú‡•Ä ‡§î‡§∞ ‡§ó‡§£‡•á‡§∂‡§ú‡•Ä ‡§ï‡•Ä ‡§Æ‡•à‡§Ç ‡§µ‡§®‡•ç‡§¶‡§®‡§æ ‡§ï‡§∞‡§§‡§æ ‡§π‡•Ç‡§Å‡••')

---

## üß† Building Our Vector Database with Chroma

Now that we have our padya meanings split and ready, the next step is to **embed** them and **store** them in a **vector database**.  
For this, we're using **Chroma**, a lightweight, open-source vector database that's very popular for working with document embeddings in LangChain.

Chroma allows us to efficiently store and search embeddings locally ‚Äî no need for external services ‚Äî making it perfect for prototyping apps like our padya finder.

### üîê API Key Setup: Don't Forget the `.env` File

Since we're making API calls to OpenAI, we need to **authenticate** using our API key.

Create a `.env` file in your project root with the following content:

```
OPENAI_API_KEY=your_openai_api_key_here
```

And don't forget to **load the environment variables** before using `OpenAIEmbeddings`

> üì¶ `.env` keeps sensitive information secure and out of your main codebase.





In [19]:
if not os.path.exists('datasets/chroma_db'):
    os.mkdir('datasets/chroma_db')

In [20]:
embedding_model = OpenAIEmbeddings(model='text-embedding-3-large')

db_padyas = Chroma(
    collection_name='padya_meanings',
    embedding_function=embedding_model,
    persist_directory='datasets/chroma_db',
)

batch_size = 50

for i in tqdm(range(0, len(documents), batch_size)):
    # print(i)
    batch = documents[i: i+batch_size]
    db_padyas.add_documents(batch)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 124/124 [04:50<00:00,  2.34s/it]


In [21]:
db_padyas._collection.count() 

6162

### üö® Why Do We Need Batching?

When using models like OpenAI's `text-embedding-ada-002`, **there are strict rate limits** on how many tokens you can send per minute.  
If we try to embed a large number of documents at once, we'll hit a **RateLimitError**.

> ‚úÖ **Batching** helps us process smaller, manageable chunks of data without exceeding API rate limits ‚Äî making the upload safe, stable, and reliable.



### üíæ How Chroma Persists Data Locally

When you use Chroma with a `persist_directory` (like `datasets/chroma_db`), it saves your vector database **on disk** automatically.



In [22]:
root_dir = 'datasets/chroma_db/'

# Create a list to store file info
file_data = []

# Walk through the directory
for foldername, subfolders, filenames in os.walk(root_dir):
    for filename in filenames:
        file_path = os.path.join(foldername, filename)
        size_bytes = os.path.getsize(file_path)
        file_data.append({
            'File Name': filename,
            'File Path': file_path,
            'Size (KB)': round(size_bytes / 1024, 2),
        })

# Convert to a pandas DataFrame for a clean table
df_files = pd.DataFrame(file_data)

df_files

Unnamed: 0,File Name,File Path,Size (KB)
0,chroma.sqlite3,datasets/chroma_db/chroma.sqlite3,28972.0
1,data_level0.bin,datasets/chroma_db/3ee9331c-32d8-4850-8fb8-9b6...,72820.31
2,length.bin,datasets/chroma_db/3ee9331c-32d8-4850-8fb8-9b6...,23.44
3,link_lists.bin,datasets/chroma_db/3ee9331c-32d8-4850-8fb8-9b6...,50.93
4,header.bin,datasets/chroma_db/3ee9331c-32d8-4850-8fb8-9b6...,0.1
5,index_metadata.pickle,datasets/chroma_db/3ee9331c-32d8-4850-8fb8-9b6...,337.92


#### üì¶ Chroma DB Folder Contents

| File Name | Description |
|:---|:---|
| `chroma.sqlite3` | The main SQLite database that stores metadata about documents and embeddings. | 
| `data_level0.bin` | Binary file containing the raw vector embeddings stored efficiently for fast search. | 
| `length.bin` | Stores the lengths of each vector entry, used to efficiently index into `data_level0.bin`. | 
| `link_lists.bin` | Contains graph link information for ANN (Approximate Nearest Neighbors) search structure. | 
| `header.bin` | Small file that stores metadata about how the binary data is organized (dimensions, format). | 
| `index_metadata.pickle` | Pickle file that contains additional metadata about the vector index for Chroma's internal use. | 


---

## üìù Reading the saved embeddings directly

Load the Chroma database from the specified `persist_directory` where it was saved. Also, initialize the embedding model you used to save the embeddings, such as `OpenAIEmbeddings`.

> **No Need to run the previous section to create embeddings in that case

In [23]:
embedding_model = OpenAIEmbeddings(model='text-embedding-3-large')

db_padyas = Chroma(
    collection_name='padya_meanings',
    embedding_function=embedding_model,
    persist_directory='datasets/chroma_db',
)

In [24]:
db_padyas._collection.count() 

6162

## üîç Querying Saved Embeddings

After saving embeddings into the Chroma database, you can query the stored data to find similar padyas, or search for content related to a specific topic. Below are the steps to effectively query the Chroma database.

### ~üîç **Perform a Similarity Search**

Once the database is loaded, you can query it by providing a textual query. The database will return the most similar padyas based on the stored embeddings. Here's how you can do a similarity search:

In [25]:
query = "Hanuman Ji meets Mata Sita"
docs = db_padyas.similarity_search(query=query, k = 10)
docs

[Document(id='e3856e3b-fec9-4456-84bd-3bfcdc7e93c2', metadata={'source': 'datasets/tagged_meanings.txt'}, page_content='2482 ‡§â‡§®‡•ç‡§π‡•ã‡§Ç‡§®‡•á ‡§Ö‡§™‡§®‡•á ‡§∂‡•ç‡§∞‡•Ä‡§Æ‡•Å‡§ñ ‡§∏‡•á ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä, ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú‡•Ä ‡§î‡§∞ ‡§∏‡§ñ‡§æ ‡§ó‡•Å‡§π ‡§ï‡•ã ‡§§‡•Ä‡§∞‡•ç‡§•‡§∞‡§æ‡§ú ‡§ï‡•Ä ‡§Æ‡§π‡§ø‡§Æ‡§æ ‡§ï‡§π‡§ï‡§∞ ‡§∏‡•Å‡§®‡§æ‡§Ø‡•Ä‡•§ ‡§§‡§¶‡§®‡§®‡•ç‡§§‡§∞ ‡§™‡•ç‡§∞‡§£‡§æ‡§Æ ‡§ï‡§∞‡§ï‡•á, ‡§µ‡§® ‡§î‡§∞ ‡§¨‡§ó‡•Ä‡§ö‡•ã‡§Ç ‡§ï‡•ã ‡§¶‡•á‡§ñ‡§§‡•á ‡§π‡•Å‡§è ‡§î‡§∞ ‡§¨‡§°‡§º‡•á ‡§™‡•ç‡§∞‡•á‡§Æ ‡§∏‡•á ‡§Æ‡§æ‡§π‡§æ‡§§‡•ç‡§Æ‡•ç‡§Ø ‡§ï‡§π‡§§‡•á ‡§π‡•Å‡§è--‡••'),
 Document(id='3e825e68-2d94-41d0-a4e0-3e6a7c3334b0', metadata={'source': 'datasets/tagged_meanings.txt'}, page_content='4225 ‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡§ú‡•Ä ‡§ï‡•á ‡§™‡•ç‡§∞‡•á‡§Æ‡§Ø‡•Å‡§ï‡•ç‡§§ ‡§µ‡§ö‡§® ‡§∏‡•Å‡§®‡§ï‡§∞ ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•á ‡§Æ‡§® ‡§Æ‡•á‡§Ç ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ‡§â‡§§‡•ç‡§™‡§®‡•ç‡§® ‡§π‡•ã ‡§ó‡§Ø‡§æ‡•§ ‡§â‡§®‡•ç‡§π‡•ã‡§Ç‡§®‡•á ‡§ú‡§æ‡§® ‡§≤‡§ø‡§Ø‡§æ  ‡§ï‡§ø ‡§Ø‡§π ‡§Æ‡§®, ‡§µ‡§ö‡§® ‡§î‡§∞ ‡§ï‡§∞‡•ç‡§

### üìö **Fetching IDs of Similar Meanings**

After retrieving similar meanings from Chroma, each meaning's content begins with the corresponding ID (since we tagged meanings with IDs earlier). To fetch full padya details like Verse, Kand, and Verse Type, we can extract the ID from the retrieved document and query our original `padyas` DataFrame.

In [26]:
padyas[padyas['id'] == int(docs[0].page_content.split()[0].strip())]

Unnamed: 0,Kand,Verse,Meaning,Verse Type,Verse Count,Page Number,id,tagged_meaning
2482,‡§Ö‡§Ø‡•ã‡§ß‡•ç‡§Ø‡§æ‡§ï‡§æ‡§£‡•ç‡§°,‡§ï‡§π‡§ø ‡§∏‡§ø‡§Ø ‡§≤‡§ñ‡§®‡§π‡§ø ‡§∏‡§ñ‡§π‡§ø ‡§∏‡•Å‡§®‡§æ‡§à‡•§ ‡§∂‡•ç‡§∞‡•Ä‡§Æ‡•Å‡§ñ ‡§§‡•Ä‡§∞‡§•‡§∞‡§æ‡§ú ‡§¨‡§°‡§º‡§æ...,"‡§â‡§®‡•ç‡§π‡•ã‡§Ç‡§®‡•á ‡§Ö‡§™‡§®‡•á ‡§∂‡•ç‡§∞‡•Ä‡§Æ‡•Å‡§ñ ‡§∏‡•á ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä, ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú‡•Ä ‡§î‡§∞ ...",‡§ö‡•å‡§™‡§æ‡§à,‡•®,‡•™‡•©‡•ß,2482,"2482 ‡§â‡§®‡•ç‡§π‡•ã‡§Ç‡§®‡•á ‡§Ö‡§™‡§®‡•á ‡§∂‡•ç‡§∞‡•Ä‡§Æ‡•Å‡§ñ ‡§∏‡•á ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä, ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú..."


Here‚Äôs what happens:
- `docs[0].page_content.split()[0].strip()` extracts the first word (the ISBN) from the document content.
- We convert it to an integer to match the `isbn13` column in the `books` DataFrame.
- Finally, we filter the DataFrame to get full book details for the matching ISBN.

### üîç **Combining Querying Functionality into a Function**

To make our semantic search workflow cleaner and reusable, we wrap the querying and retrieval logic into a single function: `retrieve_semantic_recommendations`.

Here‚Äôs what this function does:
- Takes a **query** string (what the user is searching for) and a **top_k** (how many results to return).
- Uses the Chroma database to perform a **similarity search** based on the query.
- Extracts the **ISBNs** from the matched documents.
- Looks up the corresponding book details from the `books` DataFrame.
- Returns a DataFrame of recommended books!

Now, with just one line, you can generate smart, semantic book recommendations based on any user query! üìö‚ú®

In [27]:
def retrieve_semantic_recommendations(
    query: str,
    top_k: int = 10
) -> pd.DataFrame:
    recommendations = db_padyas.similarity_search(query=query, k = top_k)
    
    padyas_list = []
    for i in range(0, len(recommendations)):
        padyas_list += [int(recommendations[i].page_content.strip('"').split()[0].strip())]
        
    return padyas[padyas['id'].isin(padyas_list)]

In [28]:
retrieve_semantic_recommendations(
    '‡§π‡§®‡•Å‡§Æ‡§æ‡§® ‡§ú‡•Ä ‡§ï‡•Ä ‡§∏‡•Ä‡§§‡§æ ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§∏‡•á ‡§≠‡•á‡§®‡•ç‡§ü',
    10
)

Unnamed: 0,Kand,Verse,Meaning,Verse Type,Verse Count,Page Number,id,tagged_meaning
1432,‡§¨‡§æ‡§≤‡§ï‡§æ‡§£‡•ç‡§°,‡§∏‡•ã‡§π‡§§‡§ø ‡§∏‡•Ä‡§Ø ‡§∞‡§æ‡§Æ ‡§ï‡•à ‡§ú‡•ã‡§∞‡•Ä‡•§ ‡§õ‡§¨‡§ø ‡§∏‡§ø‡§Ç‡§ó‡§æ‡§∞‡•Å ‡§Æ‡§®‡§π‡•Å‡§Å ‡§è‡§ï ‡§†‡•ã...,‡§∂‡•ç‡§∞‡•Ä‡§∏‡•Ä‡§§‡§æ-‡§∞‡§æ‡§Æ‡§ú‡•Ä ‡§ï‡•Ä ‡§ú‡•ã‡§°‡§º‡•Ä ‡§ê‡§∏‡•Ä ‡§∏‡•Å‡§∂‡•ã‡§≠‡§ø‡§§ ‡§π‡•ã ‡§∞‡§π‡•Ä ‡§π‡•à ...,‡§ö‡•å‡§™‡§æ‡§à,‡•™,‡•®‡•´‡•¨,1432,1432 ‡§∂‡•ç‡§∞‡•Ä‡§∏‡•Ä‡§§‡§æ-‡§∞‡§æ‡§Æ‡§ú‡•Ä ‡§ï‡•Ä ‡§ú‡•ã‡§°‡§º‡•Ä ‡§ê‡§∏‡•Ä ‡§∏‡•Å‡§∂‡•ã‡§≠‡§ø‡§§ ‡§π‡•ã ‡§∞‡§π...
2566,‡§Ö‡§Ø‡•ã‡§ß‡•ç‡§Ø‡§æ‡§ï‡§æ‡§£‡•ç‡§°,‡§Ü‡§ó‡•á‡§Ç ‡§∞‡§æ‡§Æ‡•Å ‡§≤‡§ñ‡§®‡•Å ‡§¨‡§®‡•á ‡§™‡§æ‡§õ‡•á‡§Ç‡•§ ‡§§‡§æ‡§™‡§∏ ‡§¨‡•á‡§∑ ‡§¨‡§ø‡§∞‡§æ‡§ú‡§§ ‡§ï‡§æ‡§õ‡•á...,"‡§Ü‡§ó‡•á ‡§∂‡•ç‡§∞‡•Ä‡§∞‡§æ‡§Æ‡§ú‡•Ä ‡§π‡•à‡§Ç, ‡§™‡•Ä‡§õ‡•á ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú‡•Ä ‡§∏‡•Å‡§∂‡•ã‡§≠‡§ø‡§§ ‡§π‡•à‡§Ç‡•§...",‡§ö‡•å‡§™‡§æ‡§à,‡•ß,‡•™‡•™‡•´,2566,"2566 ‡§Ü‡§ó‡•á ‡§∂‡•ç‡§∞‡•Ä‡§∞‡§æ‡§Æ‡§ú‡•Ä ‡§π‡•à‡§Ç, ‡§™‡•Ä‡§õ‡•á ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú‡•Ä ‡§∏‡•Å‡§∂‡•ã‡§≠‡§ø‡§§..."
3395,‡§Ö‡§Ø‡•ã‡§ß‡•ç‡§Ø‡§æ‡§ï‡§æ‡§£‡•ç‡§°,‡§ï‡§π‡§§‡§ø ‡§® ‡§∏‡•Ä‡§Ø ‡§∏‡§ï‡•Å‡§ö‡§ø ‡§Æ‡§® ‡§Æ‡§æ‡§π‡•Ä‡§Ç‡•§ ‡§á‡§π‡§æ‡§Å ‡§¨‡§∏‡§¨ ‡§∞‡§ú‡§®‡•Ä‡§Ç ‡§≠‡§≤ ‡§®...,"‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•Å‡§õ ‡§ï‡§π‡§§‡•Ä ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à‡§Ç, ‡§™‡§∞‡§®‡•ç‡§§‡•Å ‡§Æ‡§® ‡§Æ‡•á‡§Ç ‡§∏‡§ï‡•Å‡§ö‡§æ ...",‡§ö‡•å‡§™‡§æ‡§à,‡•™,‡•´‡•Æ‡•™,3395,"3395 ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•Å‡§õ ‡§ï‡§π‡§§‡•Ä ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à‡§Ç, ‡§™‡§∞‡§®‡•ç‡§§‡•Å ‡§Æ‡§® ‡§Æ‡•á‡§Ç ‡§∏..."
3551,‡§Ö‡§Ø‡•ã‡§ß‡•ç‡§Ø‡§æ‡§ï‡§æ‡§£‡•ç‡§°,‡§≤‡§ñ‡§®‡§π‡§ø ‡§≠‡•á‡§Ç‡§ü‡§ø ‡§™‡•ç‡§∞‡§®‡§æ‡§Æ‡•Å ‡§ï‡§∞‡§ø ‡§∏‡§ø‡§∞ ‡§ß‡§∞‡§ø ‡§∏‡§ø‡§Ø ‡§™‡§¶ ‡§ß‡•Ç‡§∞‡§ø‡•§\n...,‡§´‡§ø‡§∞ ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú‡•Ä ‡§ï‡•ã ‡§ï‡•ç‡§∞‡§Æ‡§∂‡§É ‡§≠‡•á‡§Ç‡§ü‡§ï‡§∞ ‡§§‡§•‡§æ ‡§™‡•ç‡§∞‡§£‡§æ‡§Æ ‡§ï‡§∞ ‡§ï...,‡§¶‡•ã‡•¶,‡•©‡•ß‡•Æ,‡•¨‡•ß‡•¶,3551,3551 ‡§´‡§ø‡§∞ ‡§≤‡§ï‡•ç‡§∑‡•ç‡§Æ‡§£‡§ú‡•Ä ‡§ï‡•ã ‡§ï‡•ç‡§∞‡§Æ‡§∂‡§É ‡§≠‡•á‡§Ç‡§ü‡§ï‡§∞ ‡§§‡§•‡§æ ‡§™‡•ç‡§∞‡§£‡§æ‡§Æ...
4193,‡§∏‡•Å‡§®‡•ç‡§¶‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§¶‡•á‡§ñ‡§ø ‡§Æ‡§®‡§π‡§ø ‡§Æ‡§π‡•Å‡§Å ‡§ï‡•Ä‡§®‡•ç‡§π ‡§™‡•ç‡§∞‡§®‡§æ‡§Æ‡§æ‡•§ ‡§¨‡•à‡§†‡•á‡§π‡§ø‡§Ç ‡§¨‡•Ä‡§§‡§ø ‡§ú‡§æ‡§§...,‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•ã ‡§¶‡•á‡§ñ‡§ï‡§∞ ‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡•Ç‡§ú‡•Ä‡§®‡•á ‡§â‡§®‡•ç‡§π‡•á‡§Ç ‡§Æ‡§®‡§π‡•Ä ‡§Æ‡•á‡§Ç ‡§™‡•ç...,‡§ö‡•å‡§™‡§æ‡§à,‡•™,‡•≠‡•®‡•®,4193,4193 ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•ã ‡§¶‡•á‡§ñ‡§ï‡§∞ ‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡•Ç‡§ú‡•Ä‡§®‡•á ‡§â‡§®‡•ç‡§π‡•á‡§Ç ‡§Æ‡§®‡§π‡•Ä ‡§Æ...
4224,‡§∏‡•Å‡§®‡•ç‡§¶‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§®‡§∞ ‡§¨‡§æ‡§®‡§∞‡§π‡§ø ‡§∏‡§Ç‡§ó ‡§ï‡§π‡•Å ‡§ï‡•à‡§∏‡•á‡§Ç‡•§ ‡§ï‡§π‡•Ä ‡§ï‡§•‡§æ ‡§≠‡§á ‡§∏‡§Ç‡§ó‡§§‡§ø ‡§ú‡•à‡§∏‡•á‡§Ç‡••,[‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä‡§®‡•á ‡§™‡•Ç‡§õ‡§æ--] ‡§®‡§∞ ‡§î‡§∞ ‡§µ‡§æ‡§®‡§∞ ‡§ï‡§æ ‡§∏‡§ú‡•ç‡§ú ‡§ï‡§π‡•ã ‡§ï‡•à ‡§∏‡•á...,‡§ö‡•å‡§™‡§æ‡§à,‡•¨,‡•≠‡•®‡•¨,4224,4224 [‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä‡§®‡•á ‡§™‡•Ç‡§õ‡§æ--] ‡§®‡§∞ ‡§î‡§∞ ‡§µ‡§æ‡§®‡§∞ ‡§ï‡§æ ‡§∏‡§ú‡•ç‡§ú ‡§ï‡§π‡•ã ...
4225,‡§∏‡•Å‡§®‡•ç‡§¶‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§ï‡§™‡§ø ‡§ï‡•á ‡§¨‡§ö‡§® ‡§∏‡§™‡•ç‡§∞‡•á‡§Æ ‡§∏‡•Å‡§®‡§ø ‡§â‡§™‡§ú‡§æ ‡§Æ‡§® ‡§¨‡§ø‡§∏‡•ç‡§µ‡§æ‡§∏‡•§\n‡§ú‡§æ‡§®‡§æ ...,‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡§ú‡•Ä ‡§ï‡•á ‡§™‡•ç‡§∞‡•á‡§Æ‡§Ø‡•Å‡§ï‡•ç‡§§ ‡§µ‡§ö‡§® ‡§∏‡•Å‡§®‡§ï‡§∞ ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•á ‡§Æ‡§® ...,‡§¶‡•ã‡•¶,‡•ß‡•©,‡•≠‡•®‡•≠,4225,4225 ‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡§ú‡•Ä ‡§ï‡•á ‡§™‡•ç‡§∞‡•á‡§Æ‡§Ø‡•Å‡§ï‡•ç‡§§ ‡§µ‡§ö‡§® ‡§∏‡•Å‡§®‡§ï‡§∞ ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï...
4242,‡§∏‡•Å‡§®‡•ç‡§¶‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§∏‡•Ä‡§§‡§æ ‡§Æ‡§® ‡§≠‡§∞‡•ã‡§∏ ‡§§‡§¨ ‡§≠‡§Ø‡§ä‡•§ ‡§™‡•Å‡§®‡§ø ‡§≤‡§ò‡•Å ‡§∞‡•Ç‡§™ ‡§™‡§µ‡§®‡§∏‡•Å‡§§ ‡§≤‡§Ø‡§ä‡••,‡§§‡§¨ (‡§â ‡§∏‡•á ‡§¶‡•á‡§ñ‡§ï‡§∞) ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•á ‡§Æ‡§® ‡§Æ‡•á‡§Ç ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ‡§π‡•Å‡§Ü‡•§ ...,‡§ö‡•å‡§™‡§æ‡§à,‡•´,‡•≠‡•®‡•Ø,4242,4242 ‡§§‡§¨ (‡§â ‡§∏‡•á ‡§¶‡•á‡§ñ‡§ï‡§∞) ‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä ‡§ï‡•á ‡§Æ‡§® ‡§Æ‡•á‡§Ç ‡§µ‡§ø‡§∂‡•ç‡§µ‡§æ‡§∏ ...
4244,‡§∏‡•Å‡§®‡•ç‡§¶‡§∞‡§ï‡§æ‡§£‡•ç‡§°,‡§Æ‡§® ‡§∏‡§Ç‡§§‡•ã‡§∑ ‡§∏‡•Å‡§®‡§§ ‡§ï‡§™‡§ø ‡§¨‡§æ‡§®‡•Ä‡•§ ‡§≠‡§ó‡§§‡§ø ‡§™‡•ç‡§∞‡§§‡§æ‡§™ ‡§§‡•á‡§ú ‡§¨‡§≤ ‡§∏‡§æ‡§®...,"‡§≠‡§ï‡•ç‡§§‡§ø, ‡§™‡•ç‡§∞‡§§‡§æ‡§™, ‡§§‡•á‡§ú ‡§î‡§∞ ‡§¨‡§≤ ‡§∏‡•á ‡§∏‡§®‡•Ä ‡§π‡•Å‡§à ‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡•Ç‡§ú‡•Ä ...",‡§ö‡•å‡§™‡§æ‡§à,‡•ß,‡•≠‡•©‡•¶,4244,"4244 ‡§≠‡§ï‡•ç‡§§‡§ø, ‡§™‡•ç‡§∞‡§§‡§æ‡§™, ‡§§‡•á‡§ú ‡§î‡§∞ ‡§¨‡§≤ ‡§∏‡•á ‡§∏‡§®‡•Ä ‡§π‡•Å‡§à ‡§π‡§®‡•Å‡§Æ‡§æ..."
5176,‡§≤‡§Ç‡§ï‡§æ‡§ï‡§æ‡§£‡•ç‡§°,‡§¶‡•Ç‡§∞‡§ø‡§π‡§ø ‡§§‡•á ‡§™‡•ç‡§∞‡§®‡§æ‡§Æ ‡§ï‡§™‡§ø ‡§ï‡•Ä‡§®‡•ç‡§π‡§æ‡•§ ‡§∞‡§ò‡•Å‡§™‡§§‡§ø ‡§¶‡•Ç‡§§ ‡§ú‡§æ‡§®‡§ï‡•Ä ...,‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡•Ç‡§ú‡•Ä‡§®‡•á [‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä‡§ï‡•ã] ‡§¶‡•Ç‡§∞ ‡§∏‡•á ‡§π‡•Ä ‡§™‡•ç‡§∞‡§£‡§æ‡§Æ ‡§ï‡§ø‡§Ø‡§æ‡•§ ...,‡§ö‡•å‡§™‡§æ‡§à,‡•©,‡•Æ‡•Æ‡•´,5176,5176 ‡§π‡§®‡•Å‡§Æ‡§æ‡§®‡•Ç‡§ú‡•Ä‡§®‡•á [‡§∏‡•Ä‡§§‡§æ‡§ú‡•Ä‡§ï‡•ã] ‡§¶‡•Ç‡§∞ ‡§∏‡•á ‡§π‡•Ä ‡§™‡•ç‡§∞‡§£‡§æ‡§Æ ‡§ï...


In [29]:
import re
verse = padyas.loc[506, 'Verse']
verse_parts = re.split(r'(‡•§|‡••)', verse)
verse_str = '\n'.join([verse_parts[i] + verse_parts[i + 1] for i in range(0, len(verse_parts) - 1, 2)])
verse_str

'‡§ú‡•ã‡§ó‡•Ä ‡§Ö‡§ï‡§Ç‡§ü‡§ï ‡§≠‡§è ‡§™‡§§‡§ø ‡§ó‡§§‡§ø ‡§∏‡•Å‡§®‡§§ ‡§∞‡§§‡§ø ‡§Æ‡•Å‡§∞‡•Å‡§õ‡§ø‡§§ ‡§≠‡§à‡•§\n\n‡§∞‡•ã‡§¶‡§§‡§ø ‡§¨‡§¶‡§§‡§ø ‡§¨‡§π‡•Å ‡§≠‡§æ‡§Å‡§§‡§ø ‡§ï‡§∞‡•Å‡§®‡§æ ‡§ï‡§∞‡§§‡§ø ‡§∏‡§Ç‡§ï‡§∞ ‡§™‡§π‡§ø‡§Ç ‡§ó‡§à‡••\n\n‡§Ö‡§§‡§ø ‡§™‡•ç‡§∞‡•á‡§Æ ‡§ï‡§∞‡§ø ‡§¨‡§ø‡§®‡§§‡•Ä ‡§¨‡§ø‡§¨‡§ø‡§ß ‡§¨‡§ø‡§ß‡§ø ‡§ú‡•ã‡§∞‡§ø ‡§ï‡§∞ ‡§∏‡§®‡•ç‡§Æ‡•Å‡§ñ ‡§∞‡§π‡•Ä‡•§\n\n‡§™‡•ç‡§∞‡§≠‡•Å ‡§Ü‡§∏‡•Å‡§§‡•ã‡§∑ ‡§ï‡•É‡§™‡§æ‡§≤ ‡§∏‡§ø‡§µ ‡§Ö‡§¨‡§≤‡§æ ‡§®‡§ø‡§∞‡§ñ‡§ø ‡§¨‡•ã‡§≤‡•á ‡§∏‡§π‡•Ä‡••'