# LangChain + OpaneAI RAG Proof of Concept
## Overview
* Collates provided sources into local vector store
* Persists store for future recovery
* Identifies relevant chunks and constructs prompt to send to OpenAI
### Dependencies
* User account for OpenAI and LangSmith
* API keys from OpenAI and LangSmith
* Main python dependencies:
  * LangChain: langchain-openai langchain-core langchain-text-splitters langchain-community langgraph langchain[openai]
  * GDrive: oogle-api-python-client google-auth-httplib2 google-auth-oauthlib langchain-google-community
  * Web: bs4  unstructured selenium
* Add the following to `~/.bash_profile` or `~/.zprofile` if `~/.zshrc`
```
export LANGSMITH_TRACING=true
export LANGSMITH_ENDPOINT="https://{langsmith_region}.api.smith.langchain.com"
export LANGSMITH_API_KEY="{your_langsmith_api_key}"
export OPENAI_API_KEY="{your_openai_api_key}"
```

In [1]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

if not os.environ.get("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith: ")

os.environ["USER_AGENT"] = "rag-ingestion-pipeline/1.0"


In [2]:
# Test basic setup is working
# Should see an appropriate response, and also updated metrics in LangSmith dashboard

from langchain_openai import ChatOpenAI

llm = ChatOpenAI()
llm.invoke("Hello, world! what's 3+3?")


AIMessage(content='Hello! 3 + 3 is 6.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 18, 'total_tokens': 29, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-BZQLQiUMDQDFgR3LoaEiSlwStHPQm', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--ae212afe-bc4e-4a9f-9c5e-81beec81bbb4-0', usage_metadata={'input_tokens': 18, 'output_tokens': 11, 'total_tokens': 29, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [3]:
# Set up an embeding scheme to translate and search documents 
# This will also be used if loading vector store from disk using FAISS 
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")

sample_text = (
    "LangGraph is a library for building stateful, multi-actor applications with LLMs"
)
sample_embedded_vector = embedding_model.embed_documents([sample_text])
print(str(sample_embedded_vector)[:100])  # Show the first 100 characters of the vector


[[-0.01016364898532629, 0.02342759631574154, -0.04225384443998337, -0.0015080638695508242, -0.023511


In [4]:
# Utility functions to support loading, saving updating and initialisint vector stores
from langchain_core.vectorstores import InMemoryVectorStore
from langchain.vectorstores import FAISS
from pathlib import Path
from langchain.schema import Document

# In-memory store
_vector_store = None


def init_vector_store(docs, embeddings):
    global _vector_store
    _vector_store = FAISS.from_documents(docs, embeddings)


def save_vector_store(path: str, force: bool = False):
    if Path(path).exists() and force == False:
        raise FileExistsError(f"Path '{path}' already exists. Use a different name or remove the existing store.")
    if _vector_store is None:
        raise RuntimeError("No vector store in memory to save.")
    _vector_store.save_local(path)
    print(f"New vector store saved to '{path}'.")


def load_vector_store(path: str, embedding_model, force: bool = False, **kwargs):
    global _vector_store
    if _vector_store is not None and not force:
        raise RuntimeError("A vector store is already loaded in memory. Use force=True to override.")
    if not Path(path).exists():
        raise FileNotFoundError(f"Vector store path '{path}' does not exist.")
    if "allow_dangerous_deserialization" not in kwargs:
        raise ValueError("The de-serialization relies loading a pickle file. Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set `allow_dangerous_deserialization` to `True` to enable deserialization. If you do this, make sure that you trust the source of the data. For example, if you are loading a file that you created, and know that no one else has modified the file, then this is safe to do. Do not set this to `True` if you are loading a file from an untrusted source (e.g., some random site on the internet.).")
    _vector_store = FAISS.load_local(path, embedding_model, allow_dangerous_deserialization=kwargs["allow_dangerous_deserialization"])
    print(f"Vector store loaded from '{path}'.")


def update_vector_store(new_docs: list[Document], save_path: str, force: bool = False):
    global _vector_store
    if _vector_store is None:
        raise RuntimeError("No vector store in memory. Load or initialize it first.")        
    _vector_store.add_documents(new_docs)
    print(f"Vector store updated with {len(new_docs)} new documents.")
    save_vector_store(save_path, force)


def get_vector_store():
    if _vector_store is None:
        raise RuntimeError("No vector store in memory.")
    return _vector_store

vector_store_path = "/Users/jsouthin/Projects/rag/faiss_index"
load_vector_store(vector_store_path, embedding_model, allow_dangerous_deserialization=True)


Vector store loaded from '/Users/jsouthin/Projects/rag/faiss_index'.


In [5]:
# verify it's working.  Expect at least 1 searchable document
def test_vector_store():
    if _vector_store is None:
        raise RuntimeError("No vector store loaded in memory.")
    
    try:
        # Try retrieving a dummy query
        print(f'Current vector stor: {get_vector_store()}')
        results = _vector_store.similarity_search("test", k=1)
        if results:
            print(f"Vector store contains {len(results)} searchable documents.")
        else:
            print("Vector store loaded but no searchable results found.")
    except Exception as e:
        print(f"Error testing vector store: {e}")
test_vector_store()


Current vector stor: <langchain_community.vectorstores.faiss.FAISS object at 0x12159b6d0>
Vector store contains 1 searchable documents.


In [6]:
print(f'Number of documents/chunks in vector store: {_vector_store.index.ntotal}')

Number of documents/chunks in vector store: 225


In [7]:
%load_ext autoreload
%autoreload 2


In [8]:
%load_ext autoreload

from document_loaders import load_web_document, load_pdf_document, load_google_drive_document, load_notion_document, load_dynamic_web_document

project_id = "my_first_project"

all_docs = load_dynamic_web_document(["https://www.greatyellow.earth","https://www.greatyellow.earth/about"],project_id=project_id) \
    + load_pdf_document(["/Users/jsouthin/Documents/Joe Southin - CV 2025 (A4).pdf","/Users/jsouthin/Downloads/joe southin cv 2016.pdf"],project_id=project_id) \
    + load_google_drive_document(["1xCW-ZiquUxwBLpMTn9kL6WrvPlfeBYpDuicQhiIR81w"],"/Users/jsouthin/Downloads/lively-tensor-432422-c0-407d22e805d2.json",project_id=project_id) \
    #+ load_notion_documents(token, [...])



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


  loader = GoogleDriveLoader(


In [9]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(all_docs)

# Index chunks
_ = _vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
#prompt = hub.pull("rlm/rag-prompt")

from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
'''
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don\'t know the answer, just say that you don\'t know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
'''
)



# Define state for application
class State(TypedDict):
    question: str
    project_id: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = _vector_store.similarity_search(state["question"])
    filtered_docs = [doc for doc in retrieved_docs if doc.metadata.get("project_id") == state["project_id"]]
    return {"context": filtered_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()


In [10]:
save_vector_store(vector_store_path,force=True)

New vector store saved to '/Users/jsouthin/Projects/rag/faiss_index'.


In [11]:
test_vector_store()

Current vector stor: <langchain_community.vectorstores.faiss.FAISS object at 0x12159b6d0>
Vector store contains 1 searchable documents.


In [12]:
response = graph.invoke({"question": "explain the differences between a type 1, 2, and 3 model","project_id":project_id})
print(response["answer"])

A type 2 model is known as the Customer id OHE model, while a type 3 model is referred to as the Campaign id OHE model. The main difference lies in the features used during training - for the type 2 model, the pca_id is one hot encoded, while for the type 3 model, both campaign id and client id columns are one hot encoded. In addition, the type 3 model can only generate inference or simulations for campaigns present in the previous week's training set, while type 2 can handle those not present in the training set.


In [13]:
response = graph.invoke({"question": "why the name Great Yellow?","project_id":project_id})
print(response["answer"])

The name Great Yellow is inspired by the Great Yellow bumblebee, a species that symbolizes the beauty and fragility of nature. It highlights the importance of protecting biodiversity and nurturing natural systems. Great Yellow aims to demonstrate that thriving ecosystems are economically valuable.


In [14]:
response = graph.invoke({"question": "What is joe southin's biggest accomplishment?","project_id":project_id})
print(response["answer"])

Joe Southin's biggest accomplishment is developing an ML-powered budget management solution that optimized spend allocation, increased customer retention, and now manages $78M+ in annual ad-spend.
