### Simple Gen AI App | Retrieval-Augmented Generation (RAG) pipeline using LangChain

In [None]:
import os
from dotenv import load_dotenv
# Loads environment variables from a .env file into your system's environment.
load_dotenv()
# os.environ is a dictionary in Python that holds all the environment variables available to your system during the program's execution.

# Set your OpenAI API key from environment variables. This key is used by LangChain to make API calls to OpenAI models (like GPT-4).
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

# This is needed to enable LangSmith logging/tracing for chains, prompts, and LLM calls.
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY')

# Setting this to 'true' will start recording all your chain/agent/LLM activity for debugging and evaluation.
os.environ['LANGCHAIN_TRACING_V2'] = 'true'

# This helps organize and filter traces inside the LangSmith dashboard based on this project.
os.environ['Langchain_Project'] = os.getenv('LANGCHAIN_PROJECT')

In [2]:
##### Data Ingestion - Load Data #####
from langchain_community.document_loaders import WebBaseLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
loader = WebBaseLoader("https://docs.smith.langchain.com/administration/tutorials/manage_spend")
docs = loader.load()
docs[0].page_content[:1000]

'\n\n\n\n\nOptimize tracing spend on LangSmith | 🦜️🛠️ LangSmith\n\n\n\n\n\n\nSkip to main contentWe are growing and hiring for multiple roles for LangChain, LangGraph and LangSmith. Join our team!API ReferenceRESTPythonJS/TSSearchRegionUSEUGo to AppGet StartedObservabilityEvaluationPrompt EngineeringDeployment (LangGraph Platform)AdministrationTutorialsOptimize tracing spend on LangSmithHow-to GuidesSetupConceptual GuideSelf-hostingPricingReferenceCloud architecture and scalabilityAuthz and AuthnAuthentication methodsdata_formatsEvaluationDataset transformationsRegions FAQsdk_referenceAdministrationTutorialsOptimize tracing spend on LangSmithOn this pageOptimize tracing spend on LangSmith\nRecommended ReadingBefore diving into this content, it might be helpful to read the following:\nData Retention Conceptual Docs\nUsage Limiting Conceptual Docs\n\nnoteSome of the features mentioned in this guide are not currently available in Enterprise plan due to its\ncustom nature of billing. If yo

In [4]:
# Load Data -> Document -> Divide text into chunks -> Text to Vector using OpenAI Embeddings -> Store in Vector DB
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(docs)
documents

[Document(metadata={'source': 'https://docs.smith.langchain.com/administration/tutorials/manage_spend', 'title': 'Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith', 'description': 'Before diving into this content, it might be helpful to read the following:', 'language': 'en'}, page_content='Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith\n\n\n\n\n\n\nSkip to main contentWe are growing and hiring for multiple roles for LangChain, LangGraph and LangSmith. Join our team!API ReferenceRESTPythonJS/TSSearchRegionUSEUGo to AppGet StartedObservabilityEvaluationPrompt EngineeringDeployment (LangGraph Platform)AdministrationTutorialsOptimize tracing spend on LangSmithHow-to GuidesSetupConceptual GuideSelf-hostingPricingReferenceCloud architecture and scalabilityAuthz and AuthnAuthentication methodsdata_formatsEvaluationDataset transformationsRegions FAQsdk_referenceAdministrationTutorialsOptimize tracing spend on LangSmithOn this pageOptimize tracing spend on LangSmith\nRecommended Read

In [5]:
# Embedding Technique -> OpenAI Embeddings
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=1)

In [6]:
# Store in Vector DB -> Faiss DB
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(documents, embeddings)

In [11]:
# Query the Vector DB -> Similarity Search
query = "LangSmith has two usage limits: total traces and extended"
result = vectorstore.similarity_search(query)
result[0].page_content

'Lets start by setting limits on our production usage, since that is where the majority of spend comes from.\nSetting a good total traces limit\u200b\nPicking the right "total traces" limit depends on the expected load of traces that you will send to LangSmith. You should\nclearly think about your assumptions before setting a limit.\nFor example:\n\nCurrent Load: Our gen AI application is called between 1.2-1.5 times per second, and each API request has a trace associated with it,\nmeaning we log around 100,000-130,000 traces per day\nExpected Growth in Load: We expect to double in size in the near future.\n\nFrom these assumptions, we can do a quick back-of-the-envelope calculation to get a good limit of:\nlimit = current_load_per_day * expected_growth * days/month      = 130,000 * 2 * 30      = 7,800,000 traces / month\nWe click on the edit icon on the right side of the table for our Prod row, and can enter this limit as follows:'

In [9]:
# load the model
# ChatOpenAI is a wrapper around OpenAI's chat models, including gpt-3.5-turbo and gpt-4-turbo and others.
# The model name is passed as a parameter to the constructor.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model='gpt-4.1-nano')
print(llm)

client=<openai.resources.chat.completions.completions.Completions object at 0x00000169A74A1520> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x00000169A74A1A90> root_client=<openai.OpenAI object at 0x00000169A742E390> root_async_client=<openai.AsyncOpenAI object at 0x00000169A74A3EC0> model_name='gpt-4.1-nano' model_kwargs={} openai_api_key=SecretStr('**********')


In [None]:
# Import the chaining logic to combine retrieved documents with an LLM
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Define a prompt template for answering the user's question using only the given context.
# This prompt is used to "stuff" all context into a single message and ask the LLM to answer based on it.
prompt = ChatPromptTemplate.from_template(
    """
Answer the following question based only on provided context:
<context>
{context}
<context>
"""
)

# Create a document chain using the LLM and the above prompt.
# This chain expects documents (as context) and uses the LLM to generate an answer.
document_chain = create_stuff_documents_chain(llm, prompt)
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='\nAnswer the following question based only on provided context:\n<context>\n{context}\n<context>\n'), additional_kwargs={})])
| ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x00000169A74A1520>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x00000169A74A1A90>, root_client=<openai.OpenAI object at 0x00000169A742E390>, root_async_client=<openai.AsyncOpenAI object at 0x00000169A74A3EC0>, model_name='gpt-4.1-nano', model_kwargs={}, openai_api_key=SecretStr('**********'))
| StrOutputParser(), kwargs={}, config={

In [None]:
# Import LangChain's Document type to manually construct input context.
from langchain_core.documents import Document

# Manually invoke the document_chain by passing a user question and a list of documents (context).
# This is useful for testing how the LLM responds when directly given contextual chunks without retrieval.
document_chain.invoke({
    "input":"LangSmith has two usage limits: total traces and extended",
    "context": [Document(page_content="LangSmith has two usage limits: total traces and extended retention traces. These correspond to the two metrics we've been tracking on our usage graph. We can use these in tandem to have granular control over spend.")]
})

'The two usage limits for LangSmith are total traces and extended retention traces. These limits allow for granular control over spending by monitoring and managing these two metrics simultaneously.'

In [None]:
# Convert your existing vector database into a retriever object.
# A retriever is a standardized interface that lets you fetch relevant documents from a Vector Store based on user queries.
retriever = vectorstore.as_retriever()

# Import and create a retrieval chain using retriever and document_chain.
from langchain.chains import create_retrieval_chain

# This is the main RAG chain — it retrieves relevant documents for a query, then passes them into the document_chain to get a final answer.
retrieval_chain = create_retrieval_chain(retriever, document_chain)
retrieval_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001694B0710D0>, search_kwargs={}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='\nAnswer the following question based only on provided context:\n<context>\n{context}\n<context>\n'), additional_kwargs={})])
            | ChatOpenA

In [None]:
# Invoke the retrieval chain with a natural language query.
# Behind the scenes, this will:
#   1. Use retriever to get the most relevant documents (by semantic search over vector embeddings)
#   2. Use document_chain to send those documents + query to the LLM
#   3. Return the LLM-generated response
response = retrieval_chain.invoke({
    "input": "LangSmith has two usage limits: total traces and extended"
})
response

{'input': 'LangSmith has two usage limits: total traces and extended',
 'context': [Document(id='dfb79a7c-41b8-4f54-beb3-4cf449898523', metadata={'source': 'https://docs.smith.langchain.com/administration/tutorials/manage_spend', 'title': 'Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith', 'description': 'Before diving into this content, it might be helpful to read the following:', 'language': 'en'}, page_content='Lets start by setting limits on our production usage, since that is where the majority of spend comes from.\nSetting a good total traces limit\u200b\nPicking the right "total traces" limit depends on the expected load of traces that you will send to LangSmith. You should\nclearly think about your assumptions before setting a limit.\nFor example:\n\nCurrent Load: Our gen AI application is called between 1.2-1.5 times per second, and each API request has a trace associated with it,\nmeaning we log around 100,000-130,000 traces per day\nExpected Growth in Load: We expect to 

In [21]:
# Print the response in a paragraph format
print(response['answer'])

Based on the provided context, here are key points regarding setting and managing production usage limits with LangSmith:

1. **Setting Total Traces Limits:**  
   - The limit should be based on expected load and growth.  
   - Example calculation: Current daily traces (e.g., 130,000) multiplied by expected growth (e.g., 2x) and days per month (e.g., 30 days) gives an estimated monthly limit (e.g., 7,800,000 traces/month).  
   - This limit can be set by editing the Prod row in the interface.

2. **Consumption and Cost Management:**  
   - Usage is tracked per workspace, which often correspond to development environments or teams.  
   - To cut costs, focus first on the workspace responsible for the majority of usage.

3. **Data Retention and Sampling:**  
   - LangSmith offers extended data retention for 400 days, which can be selectively applied using server-side sampling (e.g., sampling 10% of runs).  
   - Sampling helps balance cost with the need for historical debugging data.

4.