# LangChain + OpenAI RAG Proof of Concept
## Overview
* Collates provided sources into local vector store
* Persists store for future recovery
* Identifies relevant chunks and constructs prompt to send to OpenAI
### Dependencies
* User account for OpenAI and LangSmith
* API keys from OpenAI and LangSmith
* Main python dependencies:
  * LangChain: langchain-openai langchain-core langchain-text-splitters langchain-community langgraph langchain[openai]
  * GDrive: google-api-python-client google-auth-httplib2 google-auth-oauthlib langchain-google-community
  * Web: bs4 unstructured selenium
* Add the following to `~/.bash_profile` or `~/.zprofile` if `~/.zshrc`, or alternatively define these in a `.env` file in the root directory, ensure it's ignored in `.gitignore` and review `config.example.env` for example
```
export LANGSMITH_TRACING=true
export LANGSMITH_ENDPOINT="https://{langsmith_region}.api.smith.langchain.com"
export LANGSMITH_API_KEY="{your_langsmith_api_key}"
export OPENAI_API_KEY="{your_openai_api_key}"
```

In [1]:
from dotenv import load_dotenv
   
# Try to load environment variables from .env file
load_dotenv()

True

In [2]:
import getpass
import sys
import os
from pathlib import Path

# Set up environment variables if not already set
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

if not os.environ.get("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith: ")

os.environ["USER_AGENT"] = "rag-ingestion-pipeline/1.0"

# Add the project root to the Python path
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

In [3]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from src.document_loaders import (
    load_web_document,
    load_pdf_document,
    load_google_drive_document,
    load_dynamic_web_document,
    load_carbon_intensity
)
from src.vector_store import VectorStoreManager
from src.rag_pipeline import RAGPipeline

# Load all documents first
project_id = "my_first_project"
all_docs = load_dynamic_web_document(
    ["https://www.greatyellow.earth", "https://www.greatyellow.earth/about"],
    project_id=project_id
) + load_pdf_document(
    [
        "/Users/jsouthin/Documents/Joe Southin - CV 2025 (A4).pdf",
        "/Users/jsouthin/Downloads/joe southin cv 2016.pdf"
    ],
    project_id=project_id
) + load_google_drive_document(
    ["1xCW-ZiquUxwBLpMTn9kL6WrvPlfeBYpDuicQhiIR81w"],
    "/Users/jsouthin/Downloads/lively-tensor-432422-c0-407d22e805d2.json",
    project_id=project_id
) + load_carbon_intensity(project_id=project_id)

print(f"Loaded {len(all_docs)} documents")

  loader = GoogleDriveLoader(


Loaded 8 documents


In [4]:
# Initialize models
llm = ChatOpenAI()
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")

# Initialize vector store manager
vector_store = VectorStoreManager(embedding_model)

# Try to load existing store or create new one
vector_store_path = f"{project_root}/data/faiss_index"
try:
    vector_store.load_store(
        vector_store_path
    )
    print(f"Loaded existing vector store with {vector_store.document_count} documents")
    remaining_docs = all_docs  # All docs need to be added if we loaded existing store
except FileNotFoundError:
    print("No existing vector store found. Will create new one.")
    # Initialize with first document
    vector_store.init_store([all_docs[0]])
    # Remove the first document since it's already added
    remaining_docs = all_docs[1:]
    print("Initialized new vector store with first document")

# Initialize RAG pipeline
rag = RAGPipeline(vector_store, llm)

# Add remaining documents to pipeline
if remaining_docs:
    rag.add_documents(remaining_docs)

# Save updated vector store
vector_store.save_store(vector_store_path, force=True)

No existing vector store found. Will create new one.
Initialized new vector store with first document
Vector store updated with 104 new documents.
Vector store updated with 1 new documents.
Vector store saved to '/Users/jsouthin/Projects/bumblebee-rag/data/faiss_index'.


In [5]:
# Test some queries
questions = [
    "explain the differences between a type 1, 2, and 3 model",
    "what is Great Yellow?",
    "What is joe southin's biggest accomplishment?"
]

for question in questions:
    print(f"Q: {question}")
    answer = rag.query(question, project_id)
    print(f"A: {answer}\n")

Q: explain the differences between a type 1, 2, and 3 model
A: A Type 1 model in the context refers to supporting the concept of "passthrough," where publisher simulations are used as-is with minor transformations. Type 2 and Type 3 models, on the other hand, involve OHE encoding of different columns like campaign_id and client_id for generating inference and simulations. Type 3 models are considered more granular and can perform better than Type 2 models, which serve as a fallback option.

Q: what is Great Yellow?
A: Great Yellow is a bumblebee species that symbolizes the beauty and fragility of nature, clinging to survival in the wildest parts of the UK due to habitat loss. Great Yellow Ltd draws inspiration from this bee to promote nature restoration and emphasize the economic value of thriving natural ecosystems.

Q: What is joe southin's biggest accomplishment?
A: Joe Southin's biggest accomplishment includes developing an ML-powered budget management solution that optimized spend

In [6]:
# Repeat some queries, but this time change project such that the context is no longer available
questions = [
    "explain the differences between a type 1, 2, and 3 model",
    "why the name Great Yellow?",
    "What is joe southin's biggest accomplishment?"
]

for question in questions:
    print(f"Q: {question}")
    answer = rag.query(question, "some_other_project")
    print(f"A: {answer}\n")

Q: explain the differences between a type 1, 2, and 3 model
A: A type 1 model is a basic model that includes only one variable, while a type 2 model incorporates multiple variables but does not account for interactions between them. In contrast, a type 3 model considers interactions between variables, making it more complex and capable of capturing nonlinear relationships. Each type of model offers different levels of complexity and predictive power based on the relationships it can represent.

Q: why the name Great Yellow?
A: I don't know the reason for the name "Great Yellow."

Q: What is joe southin's biggest accomplishment?
A: Joe Southin's biggest accomplishment is becoming the first person to successfully climb Mount Everest without the use of supplemental oxygen in 1988.



In [7]:
print(rag.query('what is the Current Carbon Intensity for current half hour', project_id))

The current Carbon Intensity for the current half hour is as follows:
From: 2025-05-21T13:00Z
To: 2025-05-21T13:30Z
Forecast: 66
Actual: 75
Index: low.


In [8]:
#Finally resave the store to update the checksum before closing
vector_store.save_store(vector_store_path, force=True)

Vector store saved to '/Users/jsouthin/Projects/bumblebee-rag/data/faiss_index'.
