#  Gemini RAG Knowledge Engine
### A Full-Stack Retrieval-Augmented Generation (RAG) Application

**Author:** Karthik K
**Tech Stack:** Google Gemini 1.5 Flash, LangChain, ChromaDB

**Project Description:**
This notebook builds an end-to-end RAG pipeline. It ingests custom PDF/TXT documents, chunks them, embeds them into a vector database, and uses the Gemini 1.5 Flash model to answer user queries based specifically on that data. The final output is a deployed Streamlit web application.

## **Environment Setup**
Installing the necessary libraries for the RAG pipeline.
* `langchain`: Orchestration framework.
* `chromadb`: Vector database for storing document embeddings.
* `sentence-transformers`: Open-source embedding model.
* `google-generativeai`: SDK for Gemini 1.5 Flash.

In [1]:
!pip install chromadb sentence-transformers



In [2]:
!pip install -U langchain-google-genai google-generativeai

Collecting langchain-google-genai
  Using cached langchain_google_genai-3.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting google-ai-generativelanguage<1.0.0,>=0.9.0 (from langchain-google-genai)
  Using cached google_ai_generativelanguage-0.9.0-py3-none-any.whl.metadata (10 kB)
Collecting langchain-core<2.0.0,>=1.0.5 (from langchain-google-genai)
  Using cached langchain_core-1.1.0-py3-none-any.whl.metadata (3.6 kB)
INFO: pip is looking at multiple versions of google-generativeai to determine which version is compatible with other requirements. This could take a while.
Collecting google-generativeai
  Using cached google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
  Using cached google_generativeai-0.8.4-py3-none-any.whl.metadata (4.2 kB)
  Using cached google_generativeai-0.8.3-py3-none-any.whl.metadata (3.9 kB)
  Using cached google_generativeai-0.8.2-py3-none-any.whl.metadata (3.9 kB)
INFO: pip is still looking at multiple versions of google-generativeai to determine whi

In [3]:
!pip install google-generativeai



In [4]:
!pip install google-genai



In [5]:
pip install -U langchain-google-genai

Collecting langchain-google-genai
  Using cached langchain_google_genai-3.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting google-ai-generativelanguage<1.0.0,>=0.9.0 (from langchain-google-genai)
  Using cached google_ai_generativelanguage-0.9.0-py3-none-any.whl.metadata (10 kB)
Collecting langchain-core<2.0.0,>=1.0.5 (from langchain-google-genai)
  Using cached langchain_core-1.1.0-py3-none-any.whl.metadata (3.6 kB)
Using cached langchain_google_genai-3.1.0-py3-none-any.whl (55 kB)
Using cached google_ai_generativelanguage-0.9.0-py3-none-any.whl (1.4 MB)
Using cached langchain_core-1.1.0-py3-none-any.whl (473 kB)
Installing collected packages: langchain-core, google-ai-generativelanguage, langchain-google-genai
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.3.80
    Uninstalling langchain-core-0.3.80:
      Successfully uninstalled langchain-core-0.3.80
  Attempting uninstall: google-ai-generativelanguage
    Found existing installation: goog

# langchain setup

In [1]:
!pip install -U langchain



In [2]:
!pip install -U langchain langchain-google-genai



In [3]:
!pip install langchain_community



In [4]:
!pip install pypdf



# Necessary  Imports

In [5]:
# Chains
from langchain_classic.chains import RetrievalQA
from langchain_classic.chains import ConversationalRetrievalChain
from langchain_classic.memory.buffer import ConversationBufferMemory

In [6]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma


import os
from google.colab import userdata


from langchain_google_genai import ChatGoogleGenerativeAI

## **The Main Application Logic**
This cell contains the core logic for the application. It handles:
1.  **Authentication:** Loading API keys securely.
2.  **Ingestion:** Loading text/PDF documents from the data directory.
3.  **Indexing:** Splitting text into chunks and creating vector embeddings.
4.  **Retrieval Chain:** Connecting the Gemini LLM to the Vector Store.
5.  **Testing:** Running a sample query to verify the pipeline works.

In [7]:
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

GOOGLE_API_KEY = userdata.get('GEMINI_API_KEY')

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key=GOOGLE_API_KEY
)

In [8]:
messages = [

    (

        "system",

        "You are a helpful assistant that translates English to French. Translate the user sentence.",

    ),

    ("human", "I love programming."),

]

ai_msg = llm.invoke(messages)

ai_msg

AIMessage(content="J'adore la programmation.", additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-2.5-flash', 'safety_ratings': [], 'model_provider': 'google_genai'}, id='lc_run--414444ef-bcbb-45eb-b517-e61cc2090e5b-0', usage_metadata={'input_tokens': 21, 'output_tokens': 7, 'total_tokens': 28, 'input_token_details': {'cache_read': 0}})

In [9]:
from google.colab import drive
import os

#Mount Google Drive
drive.mount('/content/drive')

#Move to project folder
%cd /content/drive/My Drive/RAG-Chatbot-Project/

#Verification
print("Current folder:", os.getcwd())
print("Files in here:", os.listdir())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/RAG-Chatbot-Project
Current folder: /content/drive/My Drive/RAG-Chatbot-Project
Files in here: ['.git', 'README.md', 'data', 'chroma_db', '.ipynb_checkpoints', 'Gemini-RAG-Knowledge-Engine.ipynb', '1706.03762v7.pdf']


In [10]:
DATA_PATH = './data'

# Load documents
loader = DirectoryLoader(DATA_PATH, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()

print(f"Loaded {len(documents)} document(s).")

Loaded 2 document(s).


In [11]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

print(f"Split into {len(chunks)} chunks.")

Split into 2 chunks.


In [12]:
# Initialize the embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_model = HuggingFaceEmbeddings(model_name=model_name)

persist_directory = './chroma_db'

# Create the vector database
vectorstore = Chroma.from_documents(
    chunks,
    embedding_model,
    persist_directory=persist_directory
)

print("Success: Vector store created.")

  embedding_model = HuggingFaceEmbeddings(model_name=model_name)


Success: Vector store created.


In [None]:
import os
from google.colab import files
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_classic.chains import ConversationalRetrievalChain
from langchain_classic.memory import ConversationBufferMemory

# 1. Upload a file
print("Please upload a PDF or Text file:")
uploaded = files.upload()

# 2. Process the file
if uploaded:
    for filename in uploaded.keys():
        print(f"\nProcessing {filename}...")

        # Save file temporarily
        file_path = f"./{filename}"
        with open(file_path, "wb") as f:
            f.write(uploaded[filename])

        # Select loader
        if filename.endswith(".pdf"):
            loader = PyPDFLoader(file_path)
        else:
            loader = TextLoader(file_path)

        new_docs = loader.load()
        print(f"Loaded {len(new_docs)} pages/documents.")

        # Split text
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        new_chunks = text_splitter.split_documents(new_docs)
        print(f"Split into {len(new_chunks)} chunks.")

        # 3. Add to Database
        vectorstore.add_documents(new_chunks)
        print(f"✅ Successfully added {filename} to the database!")

    # ---------------------------------------------------------
    # 4. REFRESH THE BRAIN (Happens AFTER upload)
    # ---------------------------------------------------------
    print("🔄 Refreshing Chatbot Brain...")

    # Define Memory
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key='answer'
    )

    # Build the Conversational Chain
    qa_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
        memory=memory,
        return_source_documents=True,
        verbose=False
    )

    print("🚀 Chatbot is updated and ready for questions!")
else:
    print("No file uploaded.")

Please upload a PDF or Text file:


In [None]:
# Question 1: Initial Context
q1 = "What is the Transformer?"
print(f"👤 User: {q1}")
result1 = qa_chain.invoke({"question": q1})
print(f"🤖 Bot: {result1['answer']}\n")

# Question 2: Follow-up (Using "It")
# The bot must know that "It" refers to the Transformer from Q1
q2 = "Does it use recurrent layers?"
print(f"👤 User: {q2}")
result2 = qa_chain.invoke({"question": q2})
print(f"🤖 Bot: {result2['answer']}")

# --- Cite Sources (The Professional Touch) ---
print("\n--- 📄 Citations ---")
for doc in result2['source_documents']:
    # Get source name and page number if available
    source_name = doc.metadata.get('source', 'Unknown file')
    page_num = doc.metadata.get('page', 'Unknown page')
    print(f"- Found in: {source_name} (Page {page_num})")

In [None]:
import json
import os
from google.colab import userdata

notebook_filename = "Gemini-RAG-Knowledge-Engine.ipynb"
email = "karthikk1162@gmail.com"
name = "Karthik K"
repo_url = "github.com/karthik-k11/RAG-Chatbot-Project.git"

with open(notebook_filename, 'r', encoding='utf-8') as f:
    data = json.load(f)

if 'widgets' in data.get('metadata', {}):
    del data['metadata']['widgets']

if data['cells']:
    data['cells'] = data['cells'][:-1]

with open(notebook_filename, 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2)

print(f"👻 Created clean version of {notebook_filename} (Metadata & Push code removed).")

token = userdata.get('GITHUB_TOKEN')

!git config --global user.email "{email}"
!git config --global user.name "{name}"

!git add "{notebook_filename}"
!git commit -m "Testing done again using a large file"
!git push https://{token}@{repo_url}

print("Push complete! The notebook on GitHub does NOT contain this cell.")