# **Instructions to Run the RAG Project Notebook**

You can run this notebook either in Google Colab (easiest option) or on your local machine (VS Code, Jupyter, etc.). Follow the steps below according to your setup.

   1. Option 1: Run in Google Colab (Recommended for Beginners)
      
      * Open the notebook in Google Colab by uploading it or using the GitHub integration.

      * Install the required dependencies by running the installation cells in the notebook.

      * Set your Gemini API Key:

          * Go to the left menu in Colab → Secrets(Key)

          * Add a new variable named GOOGLE_API_KEY and paste your key from Google AI Studio.

          * Alternatively, you can set the key directly in a notebook cell:

                import os
                os.environ["GOOGLE_API_KEY"] = "your_api_key_here"


        * Run all the cells in sequence to execute the RAG pipeline.

   2. Option 2: Run Locally (VS Code, Jupyter, or Any Notebook Environment)

      * Clone the repository
      * Set up the virtual Environment
      * Get and Set up you API keys :  
        * Create a .env file in the project folder and add:
              GOOGLE_API_KEY=your_api_key_here
      * Run the notebook - all cells

# **1. RAG**

Retrieval Augmented Generation(RAG) is an AI pipeline that combines information retrival with generative models to create more accurate and cotext-aware responses. Generally used to make our LLMs to respond domain specific questions.

Langchain Framework **Docs**: https://python.langchain.com/docs/introduction/


# 1.1 SetUp

In [None]:
!pip install -qU langchain langchain-chroma langchain-community pypdf langchain-openai langchain-google-genai sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00

In [None]:
import os
from google.colab import userdata
import google.generativeai as genai

# os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
# the OPENAI is not providing the free tokens

# Get your Google API key from Colab secrets
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

# 1.2 Load the Document

First Step:

Get the data source. We take a pdf source, excel source or other sources. There are plenty of sources we can play around with. We will need pypdgloader to read pdf file. We go ahead and install it too.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/Untitled document.pdf") # your file path
documents = loader.load()

# 1.3 Chunking

We need to chunk the data. Why?
Because all the data cannot fit in the context window we need to chunk he file into smaller chunks and pass it inot the AI. The best one for pdf is RecursiveCharacterTextSplitter

Tokenizer Process Visualization : https://tiktokenizer.vercel.app/

Chunking Process Visualization : https://chunkviz.up.railway.app/



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

# 1.4 Embeddings

Now we will need to create embeddings and hopefully store them in vector database. There are several models to generate embedding. This is another rabbit hole which has to be explored as we go. This has a whole other audience. Basically two steps:


1.   Generate Embeddings
2.   Store in the vector database

Embedding models leaderboard Huggingface: https://huggingface.co/spaces/mteb/*leaderboard*

In [None]:
!pip install -qU langchain-google-genai

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=GOOGLE_API_KEY)
print(embeddings)

# OpenAI free token not availabe so using Google or Huggingface embeddings
# from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(model="text-embeddings-ada-002")
# print(embeddings)

client=<google.ai.generativelanguage_v1beta.services.generative_service.client.GenerativeServiceClient object at 0x78c6935a7770> async_client=<google.ai.generativelanguage_v1beta.services.generative_service.async_client.GenerativeServiceAsyncClient object at 0x78c692229970> model='models/embedding-001' task_type=None google_api_key=SecretStr('**********') credentials=None client_options=None transport=None request_options=None


## 1.4.1 Use of Hugging Face Embedding Model

In [None]:
!pip install -qU sentence-transformers

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# You can choose a different model from the Hugging Face model hub if needed
# https://huggingface.co/models?library=sentence-transformers

hf_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  hf_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 1.5 Vector DB

We will be using CHROMA to store our embeddings. Embeddings are backbone of RAGs. We will use the embedding functions and store the text to the vector database accordingly.

In [None]:
from langchain_chroma import Chroma as ch

# Define persistent directory
persist_dir = "chroma_db"

# Create ChromaDB vector store from the documnets

# db = ch.from_documents(
#     documents = texts,
#     embedding = embeddings,
#     persist_directory=persist_dir
#   )


# - the vector db isntance is created with :
# - the chuncked pdf data - texts - acts a knwledge base for the AI
# - the embeddings  i.e google gemini embeddings will be used
# - the storage location is - chrom_db


## The google Gemini model - we encountred the resource exhasut problem
## - so using a new hugging face model to work with it
# Create ChromaDB vector store from the documnets
db = ch.from_documents(
    documents = texts,
    embedding = hf_embeddings, # Use Hugging Face embeddings here
    persist_directory=persist_dir
)

print("ChromaDB created successfully with Hugging Face embeddings!")


ChromaDB created successfully with Hugging Face embeddings!


If you eant to view the data go forth and get them and see it.

In [None]:
# Fetch all the data stored in the Chroma vectorstore
stored_data = db.get()

# inspect the data
print("Stored IDs: ", stored_data['ids'])              # IDs of the stored chunks
print("Stored Documents: ", stored_data['documents'])  # Original text data
print("Stored Metadata: ", stored_data['metadatas'])

Stored IDs:  ['5d34a9a7-f662-4ffe-b245-c037821fb4fc', '14a9d6b1-cb2d-4b0e-80fb-9cc03c772cdd']
Stored Documents:  ['1.  Introduction  to  Communication:  Communication  serves  as  the  backbone  of  effective  collaboration  and  success  in  any  \nenvironment.\n It  encompasses  the  exchange  of  thoughts,  ideas,  feelings,  and  messages  among  individuals.  Mastering  communication  skills  is  crucial  for  building  relationships,  problem-solving,  and  \nachieving', "achieving\n common  objectives.   2.  Purpose  of  Communication:  Every  communication  interaction  serves  a  specific  purpose,  whether  it's  sharing  updates,  conveying  information,  making  decisions,  persuading  others,  or  seeking  assistance.  \nUnderstanding\n your  objective  clarifies  your  message  and  enhances  its  effectiveness."]
Stored Metadata:  [{'producer': 'Skia/PDF m142 Google Docs Renderer', 'page': 0, 'total_pages': 1, 'page_label': '1', 'creator': 'PyPDF', 'title': 'Untitled doc

# 1.6 Retriver

Build your retriver now. This is used to retieve data using data in your vector data store.

In [None]:
from langchain import hub
from langchain_google_genai import ChatGoogleGenerativeAI # Import ChatGoogleGenerativeAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriver = db.as_retriever()

prompt = hub.pull("rlm/rag-prompt")   # we can use our own prompt as well - this is inbuild prompt given by langchain

### for OpenAI
# from langchain.llms import OpenAI
# from langchain.chat_models import ChatOpenAI
# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

### for Google Generative AI
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

# Use Google's Generative AI model instead of OpenAI - instance
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.7, google_api_key = GOOGLE_API_KEY)
qe_lcel = (
    {
        "context": retriver,
        "question": RunnablePassthrough()
     }
    | prompt
    | llm
    | StrOutputParser()
)

# 1.7 Output

Now we are ready. Our RAG is ready. Let's Query the data from the document.

In [None]:
query = "What do you mean by Communication?"
# query = "What do you mean by python?"
respone = qe_lcel.invoke(query)
print(respone)

Communication is the exchange of thoughts, ideas, feelings, and messages between individuals.  It's crucial for collaboration, problem-solving, and achieving common goals.  Effective communication requires understanding its purpose and clarifying the message.


# **2. Try it Yourself** italicized text

Provide your own document and ask the question you *like*

In [None]:
# Write you code here .....