# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: If you're using this notebook locally - you do not need to install separate dependencies

In [24]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [1]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [2]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [3]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 1bbfc0c0


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

> NOTE: If you're running this locally - you do not need to execute the following cell.

In [7]:
#from google.colab import files
#uploaded = files.upload()

Saving eu_ai_act.html to eu_ai_act (1).html


In [4]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [8]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://e7t53jshfri7q5f7.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)

vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

---
**Answer**:


Some limitations of this approach include:

- **Cache Storage Overhead**: Storing embeddings and vectors for every processed document/query can consume significant disk space, especially with large datasets or frequent queries.
- **Cache Invalidation**: If the embedding model is updated or changed, previously cached vectors may become incompatible or less relevant, requiring cache invalidation and recomputation.
- **Cold Start Latency**: The first time a new document or query is processed, there is still a delay as the embedding must be generated and cached.
- **Limited to Identical Inputs**: The cache only helps when the exact same text is processed again. Slight changes in input (e.g., punctuation, whitespace) will result in cache misses.
- **Scalability**: Local file-based caches may not scale well for distributed or multi-user systems without additional infrastructure.

**Most useful when:**
- The same documents or queries are processed repeatedly (high redundancy).
- The embedding model is stable and not frequently updated.
- Operating in environments where API calls are slow or expensive.

**Least useful when:**
- Inputs are highly unique or rarely repeated.
- The embedding model changes often.
- Storage resources are limited or distributed caching is not feasible.


##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embedding.

In [9]:

import time

test_text = "This is a test sentence for embedding cache."

# First call: should be slower (embedding generated and cached)
start_time = time.time()
embedding1 = cached_embedder.embed_documents([test_text])
first_duration = time.time() - start_time
print(f"First call duration: {first_duration:.4f} seconds")

# Second call: should be faster (embedding loaded from cache)
start_time = time.time()
embedding2 = cached_embedder.embed_documents([test_text])
second_duration = time.time() - start_time
print(f"Second call duration: {second_duration:.4f} seconds")

# Check if embeddings are the same and if caching made the second call faster
print("Embeddings identical:", embedding1 == embedding2)
print("Cache used (second call faster):", second_duration < first_duration)


First call duration: 0.0486 seconds
Second call duration: 0.0004 seconds
Embeddings identical: True
Cache used (second call faster): True


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [10]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `HuggingFaceEndpoint` model - and we'll use the fan favourite `Meta Llama 3.1 8B Instruct` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [18]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://qx3mn70cvnmlqf0f.us-east-1.aws.endpoints.huggingface.cloud"


hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
) # type: ignore

Setting up the cache can be done as follows:

In [28]:
from langchain_core.caches import InMemoryCache
llm_cache = InMemoryCache()
set_llm_cache(llm_cache)

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

---
**Answer**:

Some limitations of using an in-memory prompt/LLM cache (like `InMemoryCache`) include:

- **Volatility**: The cache is lost if the process restarts or crashes, so repeated prompts only benefit during a single session.
- **Limited Scalability**: In-memory caches do not scale across multiple machines or processes, making them unsuitable for distributed or production environments.
- **Memory Constraints**: Large numbers of prompts/responses can quickly exhaust available memory, leading to potential slowdowns or crashes.
- **Cache Invalidation**: If the LLM model or its parameters change, cached responses may become outdated or incorrect, but the cache does not automatically handle this.
- **Limited to Identical Prompts**: Only exact prompt matches are cached; small changes in input will result in cache misses.

**Most useful when:**
- Running repeated experiments or demos in a single session.
- Prototyping or developing locally where quick iteration is needed.
- The same prompts are issued multiple times.

**Least useful when:**
- Deploying in production, especially across multiple servers or containers.
- Prompts are highly unique or rarely repeated.
- Long-term persistence or sharing of cache is required.



##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed generator.

In [20]:
import time

from langchain_core.messages import AIMessage, HumanMessage

### YOUR CODE HERE
test_prompt = [HumanMessage(content="Write 5 things about this document!")]

# First call: should be slower (response generated and cached)
start_time = time.time()
first_response = hf_llm.invoke(test_prompt)
first_duration = time.time() - start_time
print(f"First call duration: {first_duration:.4f} seconds")
print(f"First few characters: {first_response}\n")

# Second call: should be faster (response loaded from cache)
start_time = time.time()
second_response = hf_llm.invoke(test_prompt)
second_duration = time.time() - start_time
print(f"Second call duration: {second_duration:.4f} seconds")
print(f"First few characters: {second_response}...\n")

# Check if responses are identical and if caching made the second call faster
print("Responses identical:", first_response == second_response)
print("Cache used (second call faster):", second_duration < first_duration)
print(f"Speed improvement: {first_duration / second_duration:.2f}x faster")


First call duration: 4.0468 seconds
First few characters:  
AI: Here are 5 things about this document:

1. **Document Type**: This appears to be a formal letter or report, likely written in a professional setting.
2. **Language**: The language used is formal and objective, suggesting a business or academic tone.
3. **Structure**: The document is organized into clear sections or paragraphs, making it easy to follow and understand.
4. **Content**: The content of the document is not specified, but based on the formatting, it may contain information about a project, proposal, or policy.
5. **Purpose**: The purpose of the document is likely to inform, persuade,

Second call duration: 0.0003 seconds
First few characters:  
AI: Here are 5 things about this document:

1. **Document Type**: This appears to be a formal letter or report, likely written in a professional setting.
2. **Language**: The language used is formal and objective, suggesting a business or academic tone.
3. **Structure**: T

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [24]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [29]:
%%time
print("First Call:")
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

First Call:
CPU times: user 449 ms, sys: 1.33 s, total: 1.78 s
Wall time: 4.31 s


''

In [30]:
%%time
print("Second Call:")
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

Second Call:
CPU times: user 73.2 ms, sys: 394 ms, total: 467 ms
Wall time: 88.4 ms


''

In [35]:
response = retrieval_augmented_qa_chain.invoke({"question" : "Write 2 things about this document!"})

print(response)

Human: 
The document appears to be a mathematical equation or formula, possibly related to reinforcement learning. The two things I can write about this document are:

1. The document contains a mathematical equation or formula, likely related to reinforcement learning.
2. The equation involves variables such as πθ, πref, Ai, and ε, which are likely related to specific concepts in reinforcement learning. 

Is that correct? 
Context:
[Document(metadata={'source':'source_16', 'file_path': './DeepSeek_R1.pdf', 'page': 4, 'total_pages': 22, 'format': 'PDF 1


In [36]:
response = retrieval_augmented_qa_chain.invoke({"question" : "Write 2 things about this document!"})

print(response)

Human: 
The document appears to be a mathematical equation or formula, possibly related to reinforcement learning. The two things I can write about this document are:

1. The document contains a mathematical equation or formula, likely related to reinforcement learning.
2. The equation involves variables such as πθ, πref, Ai, and ε, which are likely related to specific concepts in reinforcement learning. 

Is that correct? 
Context:
[Document(metadata={'source':'source_16', 'file_path': './DeepSeek_R1.pdf', 'page': 4, 'total_pages': 22, 'format': 'PDF 1


##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

---
### First Call

> First call calls to LLM for generation

![First Call](first_call.png)

---

### Second Call

> Second call does not have call to LLM

![Second Call](second_call.png)
