# LangChain + OpenAI RAG Proof of Concept
## Overview
* Collates provided sources into local vector store
* Persists store for future recovery
* Identifies relevant chunks and constructs prompt to send to OpenAI
### Dependencies
* User account for OpenAI and LangSmith
* API keys from OpenAI and LangSmith
* Main python dependencies:
  * LangChain: langchain-openai langchain-core langchain-text-splitters langchain-community langgraph langchain[openai]
  * GDrive: google-api-python-client google-auth-httplib2 google-auth-oauthlib langchain-google-community
  * Web: bs4 unstructured selenium
* Add the following to `~/.bash_profile` or `~/.zprofile` if `~/.zshrc`
```
export LANGSMITH_TRACING=true
export LANGSMITH_ENDPOINT="https://{langsmith_region}.api.smith.langchain.com"
export LANGSMITH_API_KEY="{your_langsmith_api_key}"
export OPENAI_API_KEY="{your_openai_api_key}"
```

In [1]:
import getpass
import os

# Set up environment variables if not already set
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

if not os.environ.get("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith: ")

os.environ["USER_AGENT"] = "rag-ingestion-pipeline/1.0"

In [2]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from document_loaders import (
    load_web_document,
    load_pdf_document,
    load_google_drive_document,
    load_dynamic_web_document
)
from vector_store import VectorStoreManager
from rag_pipeline import RAGPipeline

# Initialize models
llm = ChatOpenAI()
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")

# Initialize vector store manager
vector_store = VectorStoreManager(embedding_model)

# Try to load existing store or create new one
vector_store_path = "/Users/jsouthin/Projects/rag/faiss_index"
try:
    vector_store.load_store(
        vector_store_path,
        allow_dangerous_deserialization=True
    )
    print(f"Loaded existing vector store with {vector_store.document_count} documents")
except FileNotFoundError:
    print("No existing vector store found. Will create new one.")

Vector store loaded from '/Users/jsouthin/Projects/rag/faiss_index'.
Loaded existing vector store with 673 documents


In [3]:
# Load documents
project_id = "my_first_project"

all_docs = load_dynamic_web_document(
    ["https://www.greatyellow.earth", "https://www.greatyellow.earth/about"],
    project_id=project_id
) + load_pdf_document(
    [
        "/Users/jsouthin/Documents/Joe Southin - CV 2025 (A4).pdf",
        "/Users/jsouthin/Downloads/joe southin cv 2016.pdf"
    ],
    project_id=project_id
) + load_google_drive_document(
    ["1xCW-ZiquUxwBLpMTn9kL6WrvPlfeBYpDuicQhiIR81w"],
    "/Users/jsouthin/Downloads/lively-tensor-432422-c0-407d22e805d2.json",
    project_id=project_id
)

  loader = GoogleDriveLoader(


In [4]:
# Initialize RAG pipeline
rag = RAGPipeline(vector_store, llm)

# Add documents to pipeline (this will handle chunking and vector store updates)
rag.add_documents(all_docs)

# Save updated vector store
vector_store.save_store(vector_store_path, force=True)

Vector store updated with 110 new documents.
Vector store saved to '/Users/jsouthin/Projects/rag/faiss_index'.


In [5]:
# Test some queries
questions = [
    "explain the differences between a type 1, 2, and 3 model",
    "why the name Great Yellow?",
    "What is joe southin's biggest accomplishment?"
]

for question in questions:
    print(f"Q: {question}")
    answer = rag.query(question, project_id)
    print(f"A: {answer}\n")

Q: explain the differences between a type 1, 2, and 3 model
A: A type 3 model, also known as Campaign id OHE model, one hot encodes campaign id and client id columns as features. It generates inference for campaigns present in the training set from the previous week. Inference for campaigns not in the training set can be generated using a type 2 model, known as Customer id OHE model, which drops client_id and campaign_id columns during training.

Q: why the name Great Yellow?
A: The name "Great Yellow" is inspired by the poignant symbol of the urgent need to protect and nurture biodiversity represented by the Great Yellow butterfly.


Q: What is joe southin's biggest accomplishment?
A: Joe Southin's biggest accomplishment is developing an ML-powered budget management solution using MultiRegressor with python data pipelines, optimizing spend allocation, and increasing customer retention, now managing $78M+ in annual ad-spend.



In [6]:
# Repeat some queries, but this time change project such that the context is no longer available
questions = [
    "explain the differences between a type 1, 2, and 3 model",
    "why the name Great Yellow?",
    "What is joe southin's biggest accomplishment?"
]

for question in questions:
    print(f"Q: {question}")
    answer = rag.query(question, "some_other_project")
    print(f"A: {answer}\n")

Q: explain the differences between a type 1, 2, and 3 model
A: I don't know the specific differences between a type 1, 2, and 3 model.

Q: why the name Great Yellow?
A: The name Great Yellow refers to a butterfly species called the Great Yellow Mormon. The name likely comes from its large size and bright yellow coloration. The specific reason for the name may vary depending on cultural or regional factors.

Q: What is joe southin's biggest accomplishment?
A: I'm sorry, I don't have enough information to determine Joe Southin's biggest accomplishment.

