# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.

🤝 BREAKOUT ROOM #1:
  - Task 1: Depends and Set-Up
  - Task 2: Setting up RAG With Production in Mind
  - Task 3: RAG LCEL Chain



## Task 1: Depends and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [None]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.5/51.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.3/2.3 MB[0m [31m76.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m91.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.9/258.9 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

We'll need an OpenAI API Key:

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [4]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - de3c7821


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

⚠ If you're running in a non-Chrome browser - you may run into issues with this cell. Please upload the file using Colab's file upload - and indicate the `file_path` in the cell with `file_path`. ⚠

![image](https://i.imgur.com/Qa1Uwlj.png)



> NOTE: You can skip this step if you are running locally - please just point to your local file.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving DeepSeek_R1 (1).pdf to DeepSeek_R1 (1) (1).pdf


In [5]:
# file_path = list(uploaded.keys())[0]
# file_path
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

In [8]:
len(docs)

73

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [9]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Adding cache!
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings, store, namespace=core_embeddings.model
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

## Answer #1
The cache is tied to the local filesystem. This means it's not easily shared across multiple machines or in a distributed environment. Limited scalability. Its also limited to exact match.

Most useful for local development and testing. Small scale application. Offline use case. Type of queries asked are very specific that exact match cache is better than semantic similar queries.

Least useful for production environment. For distributed system, high concurrency, data sharing, or larger embedding cache testing.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [40]:
### YOUR CODE HERE
retriever.invoke("Is Distillation used to train the DeepSeek-R1 model?")

[Document(metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-01-23T07:53:55+00:00', 'source': 'source_47', 'file_path': './DeepSeek_R1.pdf', 'total_pages': 22, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-01-23T07:53:55+00:00', 'trapped': '', 'modDate': 'D:20250123075355Z', 'creationDate': 'D:20250123075355Z', 'page': 13, '_id': '28d42c7b4ff74acab52931a45ee4af2a', '_collection_name': 'pdf_to_parse_1ff1facc-3a0b-4159-9762-10b93cd60ffe'}, page_content='Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly\nexceed o1-mini on most benchmarks. These results demonstrate the strong potential of distilla-\ntion. Additionally, we found that applying RL to these distilled models yields significant further\ngains. We believe this warrants further exploration and therefore present only the results of the\nsimple SFT-distilled models here.\n4. Discussion\n4.1. Distill

## Answer Activity #1
The retriever's latency fluctuates even with cache-backed embeddings. Sometimes its slower or sometimes its faster than the first time ran the retreiver for same input text. I believe this may be becasuse the OpenAI Embedding model API is fast and efficient. But I do see that having cache-backed embeddings will be beneficial if used with open source embedding model endpoint I would host for internal enterprise use.

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [10]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [11]:
from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4o-mini")

Setting up the cache can be done as follows:

In [12]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

## Answer #2
Its storing LLM responses in the computer's RAM. The limitations are:


*   The cache is lost when the Python process terminates. Cannot persist the cache
*   The cache is local to the current Python process. Only accessible from prcess where it was created. If multiple process, each process will have its own isolated cache

*   Limited by the computer's RAM size
*   Cannot share

Most useful when testing locally and want to avoid unnecessary calls to the LLM API.
Least useful in production due to lack of persistence and scalability or larger scale testing, distributed process testing, cache data persistancy testing, concurrency testing.







##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [14]:
### YOUR CODE HERE
chat_model.invoke("What method was used to train DeepSeek-R1 model?")

AIMessage(content='As of my last update in October 2023, there isn\'t a widely recognized model specifically named "DeepSeek-R1" in available literature or major AI resources. If this is a recent model or specific to a niche application, I recommend checking the latest research papers or resources from related domains for detailed methodologies used in its training. If you have more context or details about the model\'s application or field, I may be able to provide more targeted information!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 92, 'prompt_tokens': 19, 'total_tokens': 111, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_06737a9306', 'finish_reason': 'stop', 'logprobs': None}, id='run-88c7b884-b2aa-491e-9186-

## Answer Activity #2
Having cached LLM response dropped latency to 0.0 seconds when asked the same question the second time!

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [44]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

Let's test it out!

In [45]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is titled "DeepSeek_R1."\n2. It is produced using pdfTeX version 1.40.26.\n3. The creator of the document is LaTeX with hyperref.\n4. The document was created on January 23, 2025.\n5. Its source is denoted as "source_41."\n6. The document is in PDF format version 1.5.\n7. The total number of pages in the document is 22.\n8. The document includes metadata related to document creation and modification.\n9. The document consists of a section titled "DeepSeek-R1 Evaluation."\n10. It presents benchmarking metrics for various models such as Claude-3.5, GPT-4o, and DeepSeek.\n11. The evaluation metrics include MMLU, DROP, IF-Eval, GPQA Diamond, and others.\n12. Activated parameters for DeepSeek are reported as 37 billion.\n13. The total parameters for DeepSeek are reported as 671 billion.\n14. MMLU (Pass@1) scores are provided for multiple models.\n15. Benchmarks include performance on mathematical and coding tasks.\n16. Several models outperform each other 

In [46]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is titled "DeepSeek_R1."\n2. It is produced using pdfTeX version 1.40.26.\n3. The creator of the document is LaTeX with hyperref.\n4. The document was created on January 23, 2025.\n5. Its source is denoted as "source_41."\n6. The document is in PDF format version 1.5.\n7. The total number of pages in the document is 22.\n8. The document includes metadata related to document creation and modification.\n9. The document consists of a section titled "DeepSeek-R1 Evaluation."\n10. It presents benchmarking metrics for various models such as Claude-3.5, GPT-4o, and DeepSeek.\n11. The evaluation metrics include MMLU, DROP, IF-Eval, GPQA Diamond, and others.\n12. Activated parameters for DeepSeek are reported as 37 billion.\n13. The total parameters for DeepSeek are reported as 671 billion.\n14. MMLU (Pass@1) scores are provided for multiple models.\n15. Benchmarks include performance on mathematical and coding tasks.\n16. Several models outperform each other 

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

## Answer Activity #3
In the initial call of the chain with the given query, it took retriever slightly longer to find relevant context and around 18 seconds to generate response. In the second attempt it took less seconds for retriever to find relevant context and 0 seconds to output response. This is due to [query, response] being cached.

Images: LangSmith_overall.png, LangSmith_first_attempt.png, and LangSmith_second_attempt.png is added to the session 16 folder