# Build Your Own RAG System: From Theory to Implementation

Welcome to the hands-on demo where we'll:
- Initialize Pinecone as our vector database.
- Load a PDF document and split it into pages.
- Generate 768-dimension embeddings using a SentenceTransformer model.
- Upsert the embeddings into Pinecone.
- Query the index and retrieve relevant text chunks.
- Call a Gemini model endpoint (`gemini-2.0-flash`) to generate responses.
- Launch a simple Streamlit chat interface that acts as a ChatGPT-like query interface.

**Prerequisites:**  
- Basic Python programming  
- Familiarity with AI concepts  
- Understanding of APIs & web services  
- Bring your laptop!


In [None]:
%pip install "pinecone[grpc]"
%pip install PyPDF2
%pip install sentence-transformers
%pip install streamlit
%pip install requests
%pip install -q -U google-genai


# Import all the necessary libraries

In [13]:
# Import the necessary libraries
import os
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
from sentence_transformers import SentenceTransformer
import requests
import subprocess
from PyPDF2 import PdfReader
import pprint


# Let's use the class API keys

In [19]:
PINECONE_API_KEY = '<add-api-key-here>'
GEMINI_API_KEY = '<add-api-key-here>'

# Create our Pinecone Client & Index

In [5]:
# Initialize Pinecone with your API key and environment.
# Replace 'YOUR_PINECONE_API_KEY' with your actual values.

pc = Pinecone(api_key=PINECONE_API_KEY)

# Define the index name and the dimension of our embedding model (768).
if not pc.has_index("sunway-demo"):
    pc.create_index(
        name="sunway-demo",
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws", 
            region="us-east-1"
        ) 
    ) 
    
# # Connect to the index
# while not pc.describe_index(index_name).status['ready']:
#     time.sleep(1)

# Load the PDF using our PDF Reader

In [None]:
def load_pdf(filepath):
    """
    Load a PDF file and return a list where each element is the text of a page.
    """
    with open(filepath, 'rb') as file:
        reader = PdfReader(file)
        pages = [page.extract_text() for page in reader.pages]
    return pages

# Replace 'your_document.pdf' with the actual path to your PDF file.
pdf_path = 'attention_sunway.pdf'
pdf_pages = load_pdf(pdf_path)
print(f"Loaded {len(pdf_pages)} pages from the PDF.")


# Generate the embeddings for the RAG model

In [None]:
# Load the SentenceTransformer model.
# We're using a model that produces 768-dimension embeddings. 
# 'all-mpnet-base-v2' is one such model; you can choose another if desired.
model = SentenceTransformer('all-mpnet-base-v2')

def generate_embedding(text):
    """
    Generate a 768-dimension embedding for the given text.
    """
    return model.encode(text).tolist()

# Generate embeddings for each page of the PDF.
embeddings = [generate_embedding(page) for page in pdf_pages]
print("Generated embeddings for all pages.")


# Prepare the embeddings for the Vector DB & Upsert

In [None]:
# Prepare the vectors for upserting. Each vector is assigned a unique ID and metadata.
vectors = [
    {
        'id': f'page_{i}',
        'values': embedding,
        'metadata': {'page_number': i, 'text': pdf_pages[i]}
    }
    for i, embedding in enumerate(embeddings)
]

index = pc.Index("sunway-demo")

# Upsert the vectors into the Pinecone index.
# index.upsert(vectors)
print(f"Upserted {len(vectors)} vectors into the Pinecone index.")


# Query our DB for with an example query

In [None]:
def query_pinecone(query_text, top_k=5):
    """
    Query the Pinecone index for the top_k most similar text chunks.
    """
    query_embedding = generate_embedding(query_text)
    print(query_embedding)
    result = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
    return result

# Example: Query the index (this is just a test query)
test_query = "What is the transformer architecture?"
results = query_pinecone(test_query)
context = ""
# print(results)
print("Query Results:")
for match in results['matches']:
    print(f"Score: {match['score']}\nText: {match['metadata']['text']}\n")
    context += f"Score: {match['score']}\nText: {match['metadata']['text']}\n"


# Use the retrieved info + Gemini to answer our question

In [None]:
from google import genai

def generate_response(prompt, context):
    """
    Call the Gemini model to generate a response, using the context from Pinecone.
    """
    # Replace 'YOUR_API_KEY' with your actual Google AI API key.

    client = genai.Client(api_key=GEMINI_API_KEY)
    response = client.models.generate_content(
        model="gemini-2.0-flash", contents=f"Context: {context}\\n\\nPrompt: {prompt}"
    )
    return response.text
    
# Test the generation function with sample context (this may not work until you set up the endpoint)
# sample_context = " ".join([match['metadata']['text'] for match in results['results'][0]['matches']])
pprint.pprint(generate_response(test_query, context))


# Finally, put this all in a Streamlit App

In [None]:
%%writefile chat_app.py

STREAMLIT_PINECONE_API_KEY = '<copy-paste-the-pinecone-api-key-here>'
STREAMLIT_GEMINI_API_KEY = '<copy-paste-the-gemini-api-key-here>'
STREAMLIT_INDEX_NAME = 'sunway-demo'

import streamlit as st
from pinecone.grpc import PineconeGRPC as Pinecone
from sentence_transformers import SentenceTransformer
from google import genai

# Set page configuration
st.set_page_config(page_title="RAG Chatbot Demo", layout="wide")

# Initialize Pinecone
pc = Pinecone(api_key=STREAMLIT_PINECONE_API_KEY)
index = pc.Index(STREAMLIT_INDEX_NAME)

# Initialize Google Genai client
client = genai.Client(api_key=STREAMLIT_GEMINI_API_KEY)

# Load the embedding model
@st.cache_resource
def load_model():
    return SentenceTransformer('all-mpnet-base-v2')

model = load_model()

def generate_embedding(text):
    """Generate embeddings for text using SentenceTransformer"""
    return model.encode(text).tolist()

def query_pinecone(query_text, top_k=5):
    """Query Pinecone index for similar documents"""
    query_embedding = generate_embedding(query_text)
    results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
    return results

def generate_response(prompt, context):
    """Generate response using Google's Gemini model"""
    # client = genai.GenerativeModel(model_name="gemini-2.0-flash")
    response = client.models.generate_content(contents = f"Context: {context}\n\nPrompt: {prompt}", model = "gemini-2.0-flash")
    return response.text

# Initialize session state for chat history
if 'messages' not in st.session_state:
    st.session_state.messages = []

# Display chat header
st.title("📚 RAG Chatbot Demo")
st.markdown("Ask questions about the Transformer paper!")

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Get user input
if prompt := st.chat_input("What would you like to know about the Transformer architecture?"):
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    # Display user message
    with st.chat_message("user"):
        st.markdown(prompt)
    
    # Display assistant response
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        
        # Show a spinner while processing
        with st.spinner("Thinking..."):
            # Retrieve similar chunks from Pinecone
            results = query_pinecone(prompt)
            
            # Format the context from retrieved documents
            context = ""
            for match in results['matches']:
                context += f"{match['metadata']['text']}\n\n"
            
            # Generate a response using the Gemini model
            response_text = generate_response(prompt, context)
            
            # Display the response
            message_placeholder.markdown(response_text)
    
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response_text})

# Add info about the system
with st.sidebar:
    st.title("About")
    st.markdown("""
    This is a RAG (Retrieval-Augmented Generation) chatbot demo that:
    
    1. Takes your question
    2. Finds relevant passages from the Transformer paper
    3. Uses Google's Gemini model to generate a response based on the retrieved context
    
    The system uses:
    - Pinecone for vector storage
    - SentenceTransformer for embeddings
    - Google Gemini for text generation
    """)

In [None]:
# Launch the Streamlit app in a new process.
# This will open the chat interface in your default web browser.
import subprocess
subprocess.Popen(["streamlit", "run", "chat_app.py"])
print("Streamlit chat interface launched.")


## Summary

In this notebook, we have:
- Initialized Pinecone and created a vector index.
- Loaded and processed a PDF document (Attention Is All You Need) into text pages.
- Generated 768-dimensional embeddings for each page using a SentenceTransformer.
- Uploaded these embeddings to Pinecone.
- Defined functions to query the index and call a Gemini model endpoint.
- Created and launched a simple Streamlit chat interface to interact with our RAG system.

**Next Steps:**
- Replace the API keys and Index name with your credentials.
- Test and fine-tune the system with your specific PDF documents and queries.
- Explore further enhancements to the chat interface and document processing!

Happy coding and enjoy the workshop!
