### ✅ **Problem Statement:**

To build a **Retrieval-Augmented Generation (RAG) pipeline** that can answer questions about the **Indian Budget Speech 2025** by:

* Reading and preprocessing a PDF document,
* Storing its content in a **vector database (Astra DB)** using **OpenAI embeddings**,
* Creating a retriever for **semantic search**,
* Connecting it with a **Large Language Model (LLM)** for contextual question answering.

---

### ✅ **Working Code Summary:**

#### 1. **PDF Parsing**

* Load and extract raw text from the `indian_budget_speech_2025.pdf`.

#### 2. **Environment Setup**

* Load environment variables for API keys.
* Initialize connection to **Astra DB** via `cassio`.

#### 3. **Model and Embedding Initialization**

* Load OpenAI LLM and embeddings via `OpenAI()` and `OpenAIEmbeddings()`.

#### 4. **Vector Store Initialization**

* Create a `Cassandra` vector store object (`astra_vector_store`).

#### 5. **Token-Aware Text Splitting**

* Use `GPT2TokenizerFast` and `RecursiveCharacterTextSplitter` to chunk the text into manageable token-based segments.

#### 6. **Insert Chunks into Vector Store**

* Add the tokenized chunks to `astra_vector_store` using `add_texts()`.

#### 7. **Create Vector Index Wrapper**

* Wrap the vector store using `VectorStoreIndexWrapper` to enable semantic search.

#### 8. **Set Up Retriever and RAG Chain**

* Create a retriever from the vector index.
* Use `RetrievalQA` to link the retriever with the LLM, enabling question answering.

#### 9. **Query Examples**

* Ask queries like:

  * *"What are the tax benefits in the 2025 budget speech?"*
  * *"Which sectors has government mentioned in 2025 budget?"*

  The RAG pipeline retrieves relevant content and the LLM generates precise answers.
