# 🧱 Retrieval Augmented Generation using FAISS and Gemini 2.0 flash

It is a basic Retrieval-Augmented Generation (RAG) pipeline using:

* **Google Gemini** (gemini-2.0-flash) for answering queries
* **FAISS** as the vector database
* **HuggingFace Embeddings** for vectorizing text
* **LangChain** for text splitting and document management
* **PyPDF2** for PDF reading



## ✅ 1. Install Required Packages

In [1]:
!pip install -q langchain_community google-generativeai PyPDF2 langchain_huggingface faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 🔐 2. Import Libraries & Configure Gemini API

In [16]:
import os
import google.generativeai as genai
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
from dotenv import load_dotenv
from langchain.schema import Document
from PyPDF2 import PdfReader
from langchain_huggingface import HuggingFaceEmbeddings

import warnings
warnings.filterwarnings("ignore")

In [3]:
# Api Key
import google.generativeai as genai
from google.colab import userdata

google_api = userdata.get("GOOGLE-API-KEY")
genai.configure(api_key = google_api)

gemini_model = genai.GenerativeModel('gemini-2.0-flash')

## 🧠 3. Load Embedding Model

In [4]:
# ✅ Cache the embedding model loading to avoid reloading on every run
def load_embedding_model():
    return HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

embedding_model = load_embedding_model()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## 📄 4. Read PDF Content

In [5]:
# Read the pdf file

def read_pdf(file_path):
    pdf_reader = PdfReader(file_path)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text

In [26]:
# Reading the PDF Uploaded...

text = read_pdf("/content/Investoreye.pdf")

## 🧩 5. Process Text into Chunks & Vectors

In [27]:
if text.strip():
  document = Document(page_content=text)
  splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
  chunks = splitter.split_documents([document])
  texts = [chunk.page_content for chunk in chunks]
  vector_db = FAISS.from_texts(texts, embedding_model)
  retriever = vector_db.as_retriever()

## ❓ 6. Ask User for Input

In [28]:
user_query = input("Enter your question:")

Enter your question:Summarize this PDF


## 🔍 7. Retrieve & Generate Answer

In [29]:
if user_query:

  relevant_docs = retriever.get_relevant_documents(user_query)

  context = "\n\n".join([doc.page_content for doc in relevant_docs])

  prompt = f"""You are an expert assistant. Use the context below to answer the query.If unsure, say 'I don't know.'

  Context:{context}
  Query:{user_query}
  Answer:"""

  response = gemini_model.generate_content(prompt)
  print(response.text)
else:

  print("⚠️ No text could be extracted from the PDF. Please upload a readable document.")

This document is an investment report by Sharekhan on Bajaj Finance Ltd, Cholamandalam Investment and Finance Company Ltd and Federal Bank Ltd, dated April 30, 2025.

**Bajaj Finance Ltd:**
*   The report maintains a "Buy" recommendation with an unchanged price target of Rs. 10,500.
*   Net earnings were in line with estimates, AUM growth was strong, but management revised FY26 guidance slightly lower for return ratios and AUM growth, citing a focus on improving credit costs.
*   Key positives include strong AUM growth in specific loan segments and a falling cost-to-income ratio.
*   Key negatives include revised, slightly lower guidance for FY26 and a higher credit cost guidance.

**Cholamandalam Investment and Finance Company Ltd:**
*   The report maintains a "Buy" rating with a revised price target of Rs. 1,720.
*   Net earnings beat estimates due to lower opex and strong AUM growth, despite higher credit costs.
*   AUM growth is expected at 20-25% in FY26.
*   Key positives include