# RAG Introduction with Direct OpenAI and Pinecone APIs

This notebook demonstrates the complete RAG pipeline using **direct API calls** to OpenAI and Pinecone (without LangChain): document loading, chunking, embeddings, vector database setup, and retrieval-augmented generation.

## Setup and Installation

# Install required packages
!pip install openai pinecone pypdf python-dotenv -q

In [1]:
# Import necessary libraries
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Verify API keys are loaded
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in environment variables. Please create a .env file with your API key.")
if not os.getenv("CONE_APPINEI_KEY"):
    raise ValueError("PINECONE_API_KEY not found in environment variables. Please create a .env file with your API key.")

## Section 1: Document Loading and Chunking

In [2]:
# For demonstration, create sample documents
# In production, you would load from PDF using pypdf:
# from pypdf import PdfReader
# reader = PdfReader("path/to/your/document.pdf")
# text = "".join([page.extract_text() for page in reader.pages])

documents = [
    {"page_content": "Machine learning is a subset of artificial intelligence that enables systems to learn from data."},
    {"page_content": "Deep learning uses neural networks with multiple layers to process complex patterns."},
    {"page_content": "Natural language processing allows computers to understand and generate human language."}
]

In [3]:
# Custom chunking function (replaces RecursiveCharacterTextSplitter)
def chunk_text(text, chunk_size=500, chunk_overlap=50):
    """
    Split text into chunks with overlap.
    
    Args:
        text: The text to chunk
        chunk_size: Maximum size of each chunk
        chunk_overlap: Number of characters to overlap between chunks
    
    Returns:
        List of chunk dictionaries with 'page_content' and 'metadata'
    """
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        chunks.append({
            "page_content": chunk,
            "metadata": {"start": start, "end": end}
        })
        
        # Move start position forward, accounting for overlap
        start += chunk_size - chunk_overlap
        
        # Prevent infinite loop if chunk_size <= chunk_overlap
        if chunk_size <= chunk_overlap:
            break
    
    return chunks

def chunk_documents(documents, chunk_size=500, chunk_overlap=50):
    """
    Chunk a list of documents.
    
    Args:
        documents: List of document dictionaries with 'page_content'
        chunk_size: Maximum size of each chunk
        chunk_overlap: Number of characters to overlap between chunks
    
    Returns:
        List of chunk dictionaries
    """
    all_chunks = []
    for i, doc in enumerate(documents):
        text = doc["page_content"]
        chunks = chunk_text(text, chunk_size, chunk_overlap)
        # Add document index to metadata
        for chunk in chunks:
            chunk["metadata"]["document_index"] = i
        all_chunks.extend(chunks)
    return all_chunks

# Chunk the documents
chunks = chunk_documents(documents, chunk_size=500, chunk_overlap=50)
print(f"Number of chunks: {len(chunks)}")
print(f"First chunk: {chunks[0]['page_content'][:100]}...")

Number of chunks: 3
First chunk: Machine learning is a subset of artificial intelligence that enables systems to learn from data....


**Scoping Insight**: Chunking matters when documents are large or structured. For simple use cases with small documents, you might skip chunking entirely. Recognize when chunking adds value vs when it's unnecessary complexity.

## Section 2: Embeddings

In [4]:
from openai import OpenAI

# Initialize OpenAI client
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Embedding functions (replaces OpenAIEmbeddings)
def embed_query(text, client, model="text-embedding-3-small"):
    """
    Create an embedding for a single query text.
    
    Args:
        text: The text to embed
        client: OpenAI client instance
        model: Embedding model to use
    
    Returns:
        List of embedding values
    """
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

def embed_documents(texts, client, model="text-embedding-3-small"):
    """
    Create embeddings for multiple texts (batch processing).
    
    Args:
        texts: List of texts to embed
        client: OpenAI client instance
        model: Embedding model to use
    
    Returns:
        List of embedding vectors
    """
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [item.embedding for item in response.data]

# Create embeddings for a sample text
sample_text = "Machine learning enables systems to learn from data"
sample_embedding = embed_query(sample_text, openai_client)
print(f"Embedding dimension: {len(sample_embedding)}")
print(f"First 5 values: {sample_embedding[:5]}")

Embedding dimension: 1536
First 5 values: [-0.01686064898967743, -0.0005186386406421661, 0.020080123096704483, -0.015040520578622818, 0.07785451412200928]


**Scoping Insight**: Embedding costs add up with large document collections. Consider cheaper embedding models for MVPs, and upgrade only when quality matters. Understand the cost implications before committing to a solution.

## Section 3: Pinecone Vector Store Setup

In [5]:
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create or connect to an index
index_name = "rag-openai-index"

# Check if index exists, create if not
existing_indexes = [index.name for index in pc.list_indexes()]
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=1536,  # OpenAI text-embedding-3-small dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
    print(f"Created index: {index_name}")
else:
    print(f"Index {index_name} already exists")

KeyError: 'PINECONE_API_KEY'

In [None]:
# Connect to the index
index = pc.Index(index_name)

# Create embeddings for all chunks
chunk_texts = [chunk["page_content"] for chunk in chunks]
chunk_embeddings = embed_documents(chunk_texts, openai_client)

# Prepare vectors for upsert (replaces PineconeVectorStore.from_documents)
vectors_to_upsert = []
for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
    vectors_to_upsert.append({
        "id": f"chunk-{i}",
        "values": embedding,
        "metadata": {
            "text": chunk["page_content"],
            **chunk["metadata"]
        }
    })

# Upsert vectors to Pinecone
index.upsert(vectors=vectors_to_upsert)
print(f"Added {len(vectors_to_upsert)} documents to Pinecone vector store")

Added 3 documents to Pinecone vector store


**Scoping Insight**: Pinecone is powerful but adds infrastructure complexity and cost. For small projects or MVPs, consider simpler alternatives like in-memory vector stores or Chroma. Use Pinecone when you need scale, performance, or managed infrastructure.

## Section 4: Query and Retrieval

In [None]:
# Perform a similarity search (replaces vectorstore.similarity_search)
def similarity_search(query, index, openai_client, k=2):
    """
    Search for similar documents using query embedding.
    
    Args:
        query: Query text
        index: Pinecone index instance
        openai_client: OpenAI client instance
        k: Number of results to return
    
    Returns:
        List of document dictionaries with 'page_content' and 'metadata'
    """
    # Embed the query
    query_embedding = embed_query(query, openai_client)
    
    # Query Pinecone
    results = index.query(
        vector=query_embedding,
        top_k=k,
        include_metadata=True
    )
    
    # Format results
    documents = []
    for match in results.matches:
        documents.append({
            "page_content": match.metadata["text"],
            "metadata": {k: v for k, v in match.metadata.items() if k != "text"}
        })
    
    return documents

# Perform a search
query = "What is machine learning?"
results = similarity_search(query, index, openai_client, k=2)

print(f"Query: {query}")
print(f"\nRetrieved {len(results)} documents:")
for i, doc in enumerate(results, 1):
    print(f"\n{i}. {doc['page_content']}")

Query: What is machine learning?

Retrieved 2 documents:

1. Machine learning is a subset of artificial intelligence that enables systems to learn from data.

2. Deep learning uses neural networks with multiple layers to process complex patterns.


In [None]:
# Get similarity scores (replaces vectorstore.similarity_search_with_score)
def similarity_search_with_score(query, index, openai_client, k=2):
    """
    Search for similar documents with similarity scores.
    
    Args:
        query: Query text
        index: Pinecone index instance
        openai_client: OpenAI client instance
        k: Number of results to return
    
    Returns:
        List of tuples (document_dict, score)
    """
    # Embed the query
    query_embedding = embed_query(query, openai_client)
    
    # Query Pinecone
    results = index.query(
        vector=query_embedding,
        top_k=k,
        include_metadata=True
    )
    
    # Format results with scores
    documents_with_scores = []
    for match in results.matches:
        doc = {
            "page_content": match.metadata["text"],
            "metadata": {k: v for k, v in match.metadata.items() if k != "text"}
        }
        documents_with_scores.append((doc, match.score))
    
    return documents_with_scores

# Get results with scores
results_with_scores = similarity_search_with_score(query, index, openai_client, k=2)

print(f"Query: {query}")
print(f"\nRetrieved documents with scores:")
for doc, score in results_with_scores:
    print(f"\nScore: {score:.4f}")
    print(f"Content: {doc['page_content']}")

NameError: name 'query' is not defined

**Scoping Insight**: Retrieval quality varies with chunking strategy and embedding model. Test retrieval before building the full RAG system. If retrieval consistently fails, the problem might be with chunking or embeddings, not the LLM.

## Section 5: Complete RAG Implementation

In [None]:
# Complete RAG implementation using direct API calls (replaces LCEL chain)

# Initialize OpenAI chat client
chat_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def format_docs(docs):
    """Format retrieved documents into a single string."""
    return "\n\n".join(doc["page_content"] for doc in docs)

def rag_query(question, index, openai_client, chat_client, k=2):
    """
    Complete RAG pipeline: query embedding → Pinecone search → OpenAI chat.
    
    Args:
        question: User's question
        index: Pinecone index instance
        openai_client: OpenAI client for embeddings
        chat_client: OpenAI client for chat completions
        k: Number of documents to retrieve
    
    Returns:
        Answer string from the LLM
    """
    # Step 1: Embed the query
    query_embedding = embed_query(question, openai_client)
    
    # Step 2: Search Pinecone for relevant documents
    search_results = index.query(
        vector=query_embedding,
        top_k=k,
        include_metadata=True
    )
    
    # Step 3: Format retrieved context
    context_docs = []
    for match in search_results.matches:
        context_docs.append({
            "page_content": match.metadata["text"]
        })
    context = format_docs(context_docs)
    
    # Step 4: Create RAG prompt
    rag_prompt = f"""Answer the question based only on the following context:

{context}

Question: {question}

Answer:"""
    
    # Step 5: Call OpenAI chat completions
    response = chat_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": rag_prompt}
        ],
        temperature=0
    )
    
    # Step 6: Extract and return the answer
    return response.choices[0].message.content

# Ask a question
question = "What is machine learning?"
response = rag_query(question, index, openai_client, chat_client, k=2)
print(f"Question: {question}")
print(f"\nAnswer: {response}")

**Scoping Insight**: Direct API calls make each step of the RAG pipeline explicit and transparent — embedding → retrieval → formatting → prompting → LLM → parsing. This approach gives you full control but requires more code than using frameworks like LangChain. Use direct APIs when you need fine-grained control, want to understand the underlying mechanics, or prefer minimal dependencies.

## Section 6: Comparison: With vs Without RAG

In [None]:
# Without RAG: Direct API call
simple_response = chat_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0
)

print("Without RAG (direct API call):")
print(simple_response.choices[0].message.content)

In [None]:
# With RAG: Context from vector database
rag_response = rag_query("What is machine learning?", index, openai_client, chat_client, k=2)
print("With RAG (retrieved context):")
print(rag_response)

**Scoping Insight**: Compare the complexity and cost of both approaches. RAG is powerful but requires infrastructure, embeddings, and retrieval logic. Simple API calls work for many use cases. Recognize when the added complexity of RAG is justified by the requirements.