# LangChain Semantic Search / Knowledge-Base Demo ü¶úÔ∏èüîó  

This notebook shows how to build a **searchable knowledge base** and perform **retrieval-augmented generation (RAG)** using LangChain + OpenAI embeddings and local vector store.  

## Takeaways
The whole LangChain pipeline is modular!! LangChain‚Äôs design made it easy to swap:  
- different vector stores (in-memory, pinecone, ..)  
- different embedding models (openai, azure, gemini, huggingface, ..) 
- different document loaders,  
- different LLMs for RAG.  
This flexibility is one of the biggest strengths of the framework.

### Compliance options for generating embeddings
- OpenAI API (default): Secure, encrypted, but hosted on OpenAI‚Äôs servers. Good for general use.
- Azure OpenAI Service: Same models, but hosted in Microsoft Azure with enterprise-grade compliance (GDPR, HIPAA, FedRAMP, etc.).
- Local embeddings (Hugging Face, etc.): Full control, no data leaves your environment, but requires more compute resources.

Based on the LangChain ‚ÄúBuild a semantic search engine‚Äù tutorial:  
https://docs.langchain.com/oss/python/langchain/knowledge-base#openai  
## What this notebook does  
- Load documents (e.g. PDF, text, markdown) using LangChain‚Äôs document-loader abstractions.  
- Optionally split large documents into smaller chunks for better retrieval granularity.  
- Embed the text chunks into numeric vectors via OpenAI embeddings.  
- Store embeddings + metadata in a vector store for efficient similarity search.  
- Query the vector store: given a user query, retrieve relevant document snippets.
- Creates a RAG agent

In [None]:
# !pip install langchain-community pypdf langchain-core langchain

Collecting langchain
  Downloading langchain-1.1.2-py3-none-any.whl.metadata (4.9 kB)
Collecting langgraph<1.1.0,>=1.0.2 (from langchain)
  Downloading langgraph-1.0.4-py3-none-any.whl.metadata (7.8 kB)
Collecting langgraph-checkpoint<4.0.0,>=2.1.0 (from langgraph<1.1.0,>=1.0.2->langchain)
  Downloading langgraph_checkpoint-3.0.1-py3-none-any.whl.metadata (4.7 kB)
Collecting langgraph-prebuilt<1.1.0,>=1.0.2 (from langgraph<1.1.0,>=1.0.2->langchain)
  Downloading langgraph_prebuilt-1.0.5-py3-none-any.whl.metadata (5.2 kB)
Collecting langgraph-sdk<0.3.0,>=0.2.2 (from langgraph<1.1.0,>=1.0.2->langchain)
  Downloading langgraph_sdk-0.2.14-py3-none-any.whl.metadata (1.6 kB)
Collecting xxhash>=3.5.0 (from langgraph<1.1.0,>=1.0.2->langchain)
  Downloading xxhash-3.6.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (13 kB)
Collecting ormsgpack>=1.12.0 (from langgraph-checkpoint<4.0.0,>=2.1.0->langgraph<1.1.0,>=1.0.2->langchain)
  Downloading ormsgpack-1.12.0-cp312-cp312-macosx_10_12_x86_64.macosx_

In [2]:
# Load env variables for LANGSMITH tracing and OPENAI_API_KEY
from dotenv import load_dotenv
import os 

load_dotenv()

from langsmith import traceable

### Document Loading

In [3]:
from langchain_community.document_loaders import PyPDFLoader

filepath = "./datasets/nke-10k-2023.pdf"
loader = PyPDFLoader(filepath)

# Langchain PyPDFLoader loads each page as a separate doc
docs = loader.load()
print(len(docs))

107


In [4]:
print(f"{docs[10].page_content[:100]}\n")
print(docs[10].metadata)

Table of Contents
INFORMATION ABOUT OUR EXECUTIVE OFFICERS
The executive officers of NIKE, Inc. as o

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './datasets/nke-10k-2023.pdf', 'total_pages': 107, 'page': 10, 'page_label': '11'}


In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(len(all_splits))

415


## Creating embeddings
Note: Embeddings are generated on OpenAI‚Äôs servers, not locally.
### Compliance options for generating embeddings
- OpenAI API (default): Secure, encrypted, but hosted on OpenAI‚Äôs servers. Good for general use.
- Azure OpenAI Service: Same models, but hosted in Microsoft Azure with enterprise-grade compliance (GDPR, HIPAA, FedRAMP, etc.).
- Local embeddings (Hugging Face, etc.): Full control, no data leaves your environment, but requires more compute resources.

In [6]:
# !pip install -U "langchain-openai"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [7]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 1536

[0.04834814369678497, 0.0482519268989563, 0.015562810003757477, -0.0005802979576401412, 0.026940258219838142, -0.002202426316216588, -0.009290780872106552, 0.021564234048128128, 0.003556956071406603, 0.01862967014312744]


### Store embeddings in in-memory vector store

In [8]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)
ids = vector_store.add_documents(documents=all_splits)

In [9]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(len(results))
print(results[0])

4
page_content='wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,
Belgium; Taicang, China; Tomisato, Japan and Icheon, Korea, all of which we own.
Air Manufacturing Innovation manufactures cushioning components used in footwear at NIKE-owned and leased facilities located near Beaverton, Or

In [10]:
vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")

[(Document(id='56bb7546-5927-4a7a-a67f-3e9f295b1658', metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './datasets/nke-10k-2023.pdf', 'total_pages': 107, 'page': 30, 'page_label': '31', 'start_index': 1938}, page_content='FINANCIAL HIGHLIGHTS\n‚Ä¢ In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\n‚Ä¢ NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\nfiscal 2023\n‚Ä¢ Gross margin for the fiscal year decre

In [11]:
# Return documents based on similarity to an embedded query
embedding = embeddings.embed_query("What was Nike's revenue in 2023?")

vector_store.similarity_search_by_vector(embedding)


[Document(id='56bb7546-5927-4a7a-a67f-3e9f295b1658', metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './datasets/nke-10k-2023.pdf', 'total_pages': 107, 'page': 30, 'page_label': '31', 'start_index': 1938}, page_content='FINANCIAL HIGHLIGHTS\n‚Ä¢ In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\n‚Ä¢ NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\nfiscal 2023\n‚Ä¢ Gross margin for the fiscal year decrea

In [14]:
## RAG agent
from langchain.tools import tool

@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

In [15]:
from langchain.agents import create_agent
from langchain.chat_models import init_chat_model

tools = [retrieve_context]
# If desired, specify custom instructions
prompt = (
    "You have access to a tool that retrieves context from a pdf. "
    "Use the tool to help answer user queries."
)
model = init_chat_model("gpt-4.1")
agent = create_agent(model, tools, system_prompt=prompt)

In [16]:
query = (
    "What was Nike's revenue in 2023?\n\n"
    "Once you get the answer, look up why the margins were impacted if any."
)

for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()


What was Nike's revenue in 2023?

Once you get the answer, look up why the margins were impacted if any.
Tool Calls:
  retrieve_context (call_vgnJdRdrrJrac8Omt2JjrLDG)
 Call ID: call_vgnJdRdrrJrac8Omt2JjrLDG
  Args:
    query: Nike revenue 2023
Name: retrieve_context

Source: {'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './datasets/nke-10k-2023.pdf', 'total_pages': 107, 'page': 35, 'page_label': '36', 'start_index': 0}
Content: Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line