# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [24]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [1]:
import os
from dotenv import load_dotenv
import uuid

load_dotenv()

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"

HF_LLM_ENDPOINT = os.environ["HF_LLM_ENDPOINT"]
HF_EMBED_ENDPOINT = os.environ["HF_EMBED_ENDPOINT"]
HF_TOKEN = os.environ["HF_TOKEN"]

In [1]:
#import os
#import getpass

#os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [2]:
#import uuid

#os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
#os.environ["LANGCHAIN_TRACING_V2"] = "true"
#os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [2]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 44e30538


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

In [None]:
from google.colab import files
uploaded = files.upload()

In [3]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [9]:
!uv pip install -qU qdrant-client langchain-qdrant

In [7]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

#YOUR_EMBED_MODEL_URL = "https://c87ybgo18epgba6d.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=HF_EMBED_ENDPOINT,
    task="feature-extraction",
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

  from .autonotebook import tqdm as notebook_tqdm


##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

#### ANSWER ####
For the approach (LocalFileStore("./cache/")), that stores cached embeddings as binary data in a local directory (./cache/), there are several limitations I think.
1. Caching all embeddings to disk can lead to unbounded storage growth, especially with large datasets or frequent queries. 
2. Reading and writing to disk (./cache/) introduces I/O overhead compared to in-memory caching.
3. There’s no tracking of cache hits, misses, or size.

This approach is most useful for embedding a small size document and for temporary prototyping. It is a least useful approach for production embedding.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [21]:
### YOUR CODE HERE
import time
def test_cache_backed_embeddings(texts):
	start_time = time.time()
	#Call out to HuggingFaceHub’s embedding endpoint for embedding texts.
	embeddings_no_cache = hf_embeddings.embed_documents(texts)
	time_no_cache = time.time() - start_time

	start_time = time.time()
	#The method first checks the cache for the embeddings. 
	#If the embeddings are not found, the method uses the underlying embedder to embed the documents and stores the results in the cache.
	embeddings_with_cache_1 = cached_embedder.embed_documents(texts)
	time_with_cache_1 = time.time() - start_time

	start_time = time.time()
	#2nd run . At this time the embeddings of texts should already be in memory
	embeddings_with_cache_2 = cached_embedder.embed_documents(texts)
	time_with_cache_2 = time.time() - start_time

	print(f"Time without cache: {time_no_cache:.4f} seconds")
	print(f"Time with cache, 1st run: {time_with_cache_1:.4f} seconds")
	print(f"Time with cache, 2nd run: {time_with_cache_2:.4f} seconds")
	print(f"Embeddings are the same:", embeddings_with_cache_1 == embeddings_with_cache_2)


sample_texts = [
"class langchain.embeddings.cache.CacheBackedEmbeddings",
"Interface for caching results from embedding models.",
"The interface allows works with any store that implements the abstract store interface accepting keys of type str and values of list of floats."
]

test_cache_backed_embeddings(sample_texts)



Time without cache: 0.5660 seconds
Time with cache, 1st run: 0.1040 seconds
Time with cache, 2nd run: 0.0009 seconds
Embeddings are the same: True


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [10]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

#YOUR_LLM_ENDPOINT_URL = "https://dcrebqe18cydo729.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=HF_LLM_ENDPOINT,
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Setting up the cache can be done as follows:

In [11]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

##### ANSWER #####

The approach of caching the results of LLM calls in memory, allowing subsequent identical calls to return the cached result instead of re-querying the LLM. While this can improve performance in certain scenarios, it comes with several limitations and trade-offs

1. InMemoryCache stores all cached data in RAM, it is constrained by the available memory on the machine.
2. InMemoryCache stores data in memory (RAM), meaning the cache is cleared when the application restarts or crashes.
3. This approach maybe an issue for concurrent access in multi-threaded or multi-process environments.
4. InMemoryCache doesn’t provide built-in monitoring or metrics (e.g., cache hit/miss rates, memory usage)

This approach is most useful in prototyping, development, or short-lived scripts, or applications with repetitive queries or limited query diversity (e.g., a FAQ bot with a fixed set of questions).
This approach is least useful for long-running production applications such as a production chatbot serving thousands of users, or applications with multiple instances or high query volume (e.g., a web app with load-balanced servers).

##### 🏗️ Activity #2:

Create a simple experiment that tests the LLM cache.

In [23]:
### YOUR CODE HERE
import time

def test_cache_LLM(question):
    start_time = time.time()
	#First call to the LLM. The result of LLM call should be saved in memory
    hf_llm.invoke(question)
    time_no_cache = time.time() - start_time

    #Second call to the LLM with the same question. 
    start_time = time.time()
	#Second cal to the LLM. The answer already in the memory for the same question
    hf_llm.invoke(question)
    time_with_cache = time.time() - start_time

    print(f"First call to LLM -- time without cache: {time_no_cache:.4f} seconds")
    print(f"Second call to LLM -- time with cache: {time_with_cache:.4f} seconds")

test_cache_LLM("Could you please give me a summary on DeepSeek-R1 within 100 words?")




First call to LLM -- time without cache: 8.1711 seconds
Second call to LLM -- time with cache: 0.0005 seconds


## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [12]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [14]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})



"The document is a list of contributors to a project, with their names and possibly initials.\n\n1. The document is a list of contributors.\n2. The document is in PDF format.\n3. The document was created on January 23, 2025.\n4. The document has 22 pages.\n5. The document was produced by pdfTeX-1.40.26.\n6. The document was created using LaTeX with hyperref.\n7. The document's title is empty.\n8. The document's author is empty.\n9. The document's subject is empty.\n10. The document's keywords are empty.\n11. The"

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!


#### ANSWER ####

- Summary of the differences in the LangSmith traces of the first run without cache and the second run with cache.  And the HuggingFace Endpoint of LLM was visited in the first run while it was not visited in the second run, which means the LLM answer was retrieved from the memory in the second run.


<img src="img/ComparingNumbers.jpg" />


- Comparing the overall latencies on the two runs

<img src="img/LangSmithTrace_LLM_Calls.jpg" />


- Comparing the latencies of VectorStore Retrievers

<img src="img/LangSmithTrace_Cache_Embeddings.jpg" />