# Build Your Own RAG System: From Theory to Implementation

Welcome to the hands-on demo where we'll:
- Initialize Pinecone as our vector database.
- Load a PDF document and split it into pages.
- Generate 768-dimension embeddings using a SentenceTransformer model.
- Upsert the embeddings into Pinecone.
- Query the index and retrieve relevant text chunks.
- Call a Gemini model endpoint (`gemini-2.0-flash`) to generate responses.
- Launch a simple Streamlit chat interface that acts as a ChatGPT-like query interface.

**Prerequisites:**  
- Basic Python programming  
- Familiarity with AI concepts  
- Understanding of APIs & web services  
- Bring your laptop!


In [1]:
%pip install "pinecone[grpc]"
%pip install PyPDF2
%pip install sentence-transformers
%pip install streamlit
%pip install requests
%pip install -q -U google-genai


Collecting pinecone[grpc]
  Using cached pinecone-6.0.1-py3-none-any.whl.metadata (8.8 kB)
Collecting certifi>=2019.11.17 (from pinecone[grpc])
  Using cached certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)
Collecting googleapis-common-protos>=1.66.0 (from pinecone[grpc])
  Using cached googleapis_common_protos-1.68.0-py2.py3-none-any.whl.metadata (5.1 kB)
Collecting grpcio>=1.59.0 (from pinecone[grpc])
  Using cached grpcio-1.70.0-cp312-cp312-win_amd64.whl.metadata (4.0 kB)
Collecting lz4>=3.1.3 (from pinecone[grpc])
  Using cached lz4-4.4.3-cp312-cp312-win_amd64.whl.metadata (3.9 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone[grpc])
  Using cached pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting protobuf<6.0,>=5.29 (from pinecone[grpc])
  Using cached protobuf-5.29.3-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Collecting protoc-gen-openapiv2<0.0.2,>=0.0.1 (from pinecone[grpc])
  Using cached protoc_gen_openapiv2-0.0.1-py3-non

# Import all the necessary libraries

In [2]:
# Import the necessary libraries
import os
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
from sentence_transformers import SentenceTransformer
import requests
import subprocess
from PyPDF2 import PdfReader
import pprint


  from .autonotebook import tqdm as notebook_tqdm


# Let's use the class API keys

In [3]:
PINECONE_API_KEY = 'pcsk_49H1KG_LWe5PjAUyYUQzsosFHuZMSqQhVRdKmXVVkncZXgfztXKqhPnVtndPD8SnTZ277F'
GEMINI_API_KEY = 'AIzaSyDLOcRQekIIRmiW1dzqT2HdVigaW2ZPXHM'

# Create our Pinecone Client & Index

In [4]:
# Initialize Pinecone with your API key and environment.
# Replace 'YOUR_PINECONE_API_KEY' with your actual values.

pc = Pinecone(api_key=PINECONE_API_KEY)

# Define the index name and the dimension of our embedding model (768).
if not pc.has_index("sunway-demo"):
    pc.create_index(
        name="sunway-demo",
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws", 
            region="us-east-1"
        ) 
    ) 
    
# # Connect to the index
# while not pc.describe_index(index_name).status['ready']:
#     time.sleep(1)

# Load the PDF using our PDF Reader

In [5]:
def load_pdf(filepath):
    """
    Load a PDF file and return a list where each element is the text of a page.
    """
    with open(filepath, 'rb') as file:
        reader = PdfReader(file)
        pages = [page.extract_text() for page in reader.pages]
    return pages

# Replace 'your_document.pdf' with the actual path to your PDF file.
pdf_path = 'attention_sunway.pdf'
pdf_pages = load_pdf(pdf_path)
print(f"Loaded {len(pdf_pages)} pages from the PDF.")


Loaded 15 pages from the PDF.


# Generate the embeddings for the RAG model

In [6]:
# Load the SentenceTransformer model.
# We're using a model that produces 768-dimension embeddings. 
# 'all-mpnet-base-v2' is one such model; you can choose another if desired.
model = SentenceTransformer('all-mpnet-base-v2')

def generate_embedding(text):
    """
    Generate a 768-dimension embedding for the given text.
    """
    return model.encode(text).tolist()

# Generate embeddings for each page of the PDF.
embeddings = [generate_embedding(page) for page in pdf_pages]
print("Generated embeddings for all pages.")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Generated embeddings for all pages.


# Prepare the embeddings for the Vector DB & Upsert

In [7]:
# Prepare the vectors for upserting. Each vector is assigned a unique ID and metadata.
vectors = [
    {
        'id': f'page_{i}',
        'values': embedding,
        'metadata': {'page_number': i, 'text': pdf_pages[i]}
    }
    for i, embedding in enumerate(embeddings)
]

index = pc.Index("sunway-demo")

# Upsert the vectors into the Pinecone index.
# index.upsert(vectors)
print(f"Upserted {len(vectors)} vectors into the Pinecone index.")


Upserted 15 vectors into the Pinecone index.


# Query our DB for with an example query

In [8]:
def query_pinecone(query_text, top_k=5):
    """
    Query the Pinecone index for the top_k most similar text chunks.
    """
    query_embedding = generate_embedding(query_text)
    print(query_embedding)
    result = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
    return result

# Example: Query the index (this is just a test query)
test_query = "What is the transformer architecture?"
results = query_pinecone(test_query)
context = ""
# print(results)
print("Query Results:")
for match in results['matches']:
    print(f"Score: {match['score']}\nText: {match['metadata']['text']}\n")
    context += f"Score: {match['score']}\nText: {match['metadata']['text']}\n"


[0.004792159888893366, -0.04411180689930916, -0.008174018003046513, 0.019512588158249855, -0.054103586822748184, 0.02020278200507164, 0.0339236743748188, -0.011835883371531963, -0.03765897452831268, -0.02184951677918434, -0.018313951790332794, -0.020715603604912758, 0.04494738578796387, 0.005481509957462549, 0.034008849412202835, -0.024818534031510353, 0.0002152447559637949, 0.055047426372766495, -0.0879608690738678, -0.025363944470882416, -0.038289185613393784, 0.0184090044349432, 0.0005629662191495299, 0.03591940551996231, -0.0005498060490936041, 0.015584154985845089, -0.004693842958658934, -0.010963951237499714, 0.01897309720516205, 0.004550150595605373, 0.04372534528374672, 0.01108501572161913, -0.04726738855242729, 0.03201117739081383, 1.2995146789762657e-06, 0.006706994492560625, 0.0017437684582546353, -0.022330615669488907, 0.016972780227661133, -0.020598584786057472, 0.024342115968465805, -0.0030620316974818707, -0.03130951523780823, -0.001640515518374741, -0.020411338657140732

# Use the retrieved info + Gemini to answer our question

In [9]:
from google import genai

def generate_response(prompt, context):
    """
    Call the Gemini model to generate a response, using the context from Pinecone.
    """
    # Replace 'YOUR_API_KEY' with your actual Google AI API key.

    client = genai.Client(api_key=GEMINI_API_KEY)
    response = client.models.generate_content(
        model="gemini-2.0-flash", contents=f"Context: {context}\\n\\nPrompt: {prompt}"
    )
    return response.text
    
# Test the generation function with sample context (this may not work until you set up the endpoint)
# sample_context = " ".join([match['metadata']['text'] for match in results['results'][0]['matches']])
pprint.pprint(generate_response(test_query, context))


('The Transformer architecture is a sequence transduction model based entirely '
 'on attention mechanisms. It replaces the recurrent layers commonly used in '
 'encoder-decoder architectures with multi-headed self-attention. The model '
 'consists of an encoder and a decoder, each composed of stacked layers. The '
 'encoder and decoder stacks both have N=6 identical layers. Each layer in the '
 'encoder has two sub-layers: a multi-head self-attention mechanism and a '
 'position-wise fully connected feed-forward network. The decoder has the same '
 'sub-layers as the encoder, but also includes a third sub-layer that performs '
 'multi-head attention over the output of the encoder stack. The architecture '
 'uses residual connections around each of the sub-layers, followed by layer '
 'normalization. It also employs learned embeddings to convert input and '
 'output tokens to vectors.\n')


# Finally, put this all in a Streamlit App

In [10]:
%%writefile chat_app.py

STREAMLIT_PINECONE_API_KEY = '<copy-paste-the-pinecone-api-key-here>'
STREAMLIT_GEMINI_API_KEY = '<copy-paste-the-gemini-api-key-here>'
STREAMLIT_INDEX_NAME = 'sunway-demo'

import streamlit as st
from pinecone.grpc import PineconeGRPC as Pinecone
from sentence_transformers import SentenceTransformer
from google import genai

# Set page configuration
st.set_page_config(page_title="RAG Chatbot Demo", layout="wide")

# Initialize Pinecone
pc = Pinecone(api_key=STREAMLIT_PINECONE_API_KEY)
index = pc.Index(STREAMLIT_INDEX_NAME)

# Initialize Google Genai client
client = genai.Client(api_key=STREAMLIT_GEMINI_API_KEY)

# Load the embedding model
@st.cache_resource
def load_model():
    return SentenceTransformer('all-mpnet-base-v2')

model = load_model()

def generate_embedding(text):
    """Generate embeddings for text using SentenceTransformer"""
    return model.encode(text).tolist()

def query_pinecone(query_text, top_k=5):
    """Query Pinecone index for similar documents"""
    query_embedding = generate_embedding(query_text)
    results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
    return results

def generate_response(prompt, context):
    """Generate response using Google's Gemini model"""
    # client = genai.GenerativeModel(model_name="gemini-2.0-flash")
    response = client.models.generate_content(contents = f"Context: {context}\n\nPrompt: {prompt}", model = "gemini-2.0-flash")
    return response.text

# Initialize session state for chat history
if 'messages' not in st.session_state:
    st.session_state.messages = []

# Display chat header
st.title("📚 RAG Chatbot Demo")
st.markdown("Ask questions about the Transformer paper!")

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Get user input
if prompt := st.chat_input("What would you like to know about the Transformer architecture?"):
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    # Display user message
    with st.chat_message("user"):
        st.markdown(prompt)
    
    # Display assistant response
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        
        # Show a spinner while processing
        with st.spinner("Thinking..."):
            # Retrieve similar chunks from Pinecone
            results = query_pinecone(prompt)
            
            # Format the context from retrieved documents
            context = ""
            for match in results['matches']:
                context += f"{match['metadata']['text']}\n\n"
            
            # Generate a response using the Gemini model
            response_text = generate_response(prompt, context)
            
            # Display the response
            message_placeholder.markdown(response_text)
    
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response_text})

# Add info about the system
with st.sidebar:
    st.title("About")
    st.markdown("""
    This is a RAG (Retrieval-Augmented Generation) chatbot demo that:
    
    1. Takes your question
    2. Finds relevant passages from the Transformer paper
    3. Uses Google's Gemini model to generate a response based on the retrieved context
    
    The system uses:
    - Pinecone for vector storage
    - SentenceTransformer for embeddings
    - Google Gemini for text generation
    """)

Overwriting chat_app.py


In [11]:
# Launch the Streamlit app in a new process.
# This will open the chat interface in your default web browser.
import subprocess
subprocess.Popen(["streamlit", "run", "chat_app.py"])
print("Streamlit chat interface launched.")


Streamlit chat interface launched.


## Summary

In this notebook, we have:
- Initialized Pinecone and created a vector index.
- Loaded and processed a PDF document (Attention Is All You Need) into text pages.
- Generated 768-dimensional embeddings for each page using a SentenceTransformer.
- Uploaded these embeddings to Pinecone.
- Defined functions to query the index and call a Gemini model endpoint.
- Created and launched a simple Streamlit chat interface to interact with our RAG system.

**Next Steps:**
- Replace the API keys and Index name with your credentials.
- Test and fine-tune the system with your specific PDF documents and queries.
- Explore further enhancements to the chat interface and document processing!

Happy coding and enjoy the workshop!
