# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [1]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.5/51.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.9/258.9 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We'll need an OpenAI API Key:

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Week 8 Assignment 1 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key:··········


Let's verify our project so we can leverage it in LangSmith later.

In [4]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Week 8 Assignment 1 - f9738ac1


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

In [5]:
from google.colab import files
uploaded = files.upload()

Saving Frankenstein.pdf to Frankenstein.pdf


In [6]:
file_path = list(uploaded.keys())[0]
file_path

'Frankenstein.pdf'

We'll define our chunking strategy.

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [8]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [9]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Adding cache!
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings, store, namespace=core_embeddings.model
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

1. If the underlying models change, or you need to rebuild something, your cache is basically indalvidated, and will need regenerated.
2. If you're not re-using results, caching doesn't really provide any benefit (it's actually more costly if you're only ever using it one time, since you have to spend resources to cache it in the first place)

This is most useful when you're expecting to get the same queries or use the same pieces of context over and over. It's also good for fairly static datasets. That means the cache won't really change much after being built.

It's least useful for one-time use, for reasons explained above.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [12]:
### YOUR CODE HERE
import time

# Function to run the embedding process and measure time
def run_cache_test(query):
    start_time = time.time()
    _ = retriever.invoke(query)
    end_time = time.time()
    return end_time - start_time

# Test text
test_text = "Who made the monster?"

# First run (without cache)
first_run_time = run_cache_test(test_text)
print(f"First run time (without cache): {first_run_time:.4f} seconds")

# Second run (with cache)
second_run_time = run_cache_test(test_text)
print(f"Second run time (with cache): {second_run_time:.4f} seconds")

# Calculate and print the speedup
speedup = (first_run_time - second_run_time) / first_run_time * 100
print(f"Speedup: {speedup:.2f}%")

First run time (without cache): 0.5959 seconds
Second run time (with cache): 0.1871 seconds
Speedup: 68.60%


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [13]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [22]:
from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4o-mini")

Setting up the cache can be done as follows:

In [23]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!
1. LLMs can be sensitive to minor changes in context or prompt wording, potentially leading to over-caching of similar but not identical queries.
2.  If the information in the prompt becomes outdated, the cached response will continue to provide old data.

This is useful for times when the same exact prompts are frequently used, though. It's less useful for scenarios where each interaction should be tailored to the user's current context.


##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [24]:
def run_llm_test(prompt):
    start_time = time.time()
    response = chat_model.invoke(prompt)
    end_time = time.time()
    return end_time - start_time, response

# Test prompt
test_prompt = "What is Frankenstein's monster?"

# First run (without cache)
first_run_time, first_response = run_llm_test(test_prompt)
print(f"First run time (without cache): {first_run_time:.4f} seconds")
print(f"First response: {first_response.content[:50]}...")  # Print first 50 characters

# Second run (with cache)
second_run_time, second_response = run_llm_test(test_prompt)
print(f"\nSecond run time (with cache): {second_run_time:.4f} seconds")
print(f"Second response: {second_response.content[:50]}...")  # Print first 50 characters

# Calculate and print the speedup
speedup = (first_run_time - second_run_time) / first_run_time * 100
print(f"\nSpeedup: {speedup:.2f}%")

# Verify cache is working
print(f"\nResponses identical: {first_response.content == second_response.content}")

# Test with a slightly different prompt
slightly_different_prompt = "Who is Frankenstein's monster?"
third_run_time, third_response = run_llm_test(slightly_different_prompt)
print(f"\nThird run time (different prompt): {third_run_time:.4f} seconds")
print(f"Third response: {third_response.content[:50]}...")

First run time (without cache): 3.1790 seconds
First response: Frankenstein's monster is a fictional character th...

Second run time (with cache): 0.0021 seconds
Second response: Frankenstein's monster is a fictional character th...

Speedup: 99.93%

Responses identical: True

Third run time (different prompt): 4.4612 seconds
Third response: Frankenstein's monster is a fictional character fr...


## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [25]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

Let's test it out!

In [26]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is a PDF version of "Frankenstein" by Mary Wollstonecraft Shelley.\n2. It contains a total of 277 pages.\n3. The document was created using Adobe InDesign CS2.\n4. The PDF format is version 1.5.\n5. The author, Mary Wollstonecraft Shelley, is known for her contributions to Gothic literature.\n6. The document was produced using Adobe PDF Library 7.0.\n7. The creation date of the document is February 6, 2008.\n8. The last modification date is July 6, 2008.\n9. This document is available for free download from Planet eBook.\n10. The text includes themes of creation and the consequences of scientific exploration.\n11. The story features a complex relationship between creator and creation.\n12. The protagonist, Victor Frankenstein, is a scientist who creates a living being.\n13. The creature, often referred to as Frankenstein\'s monster, grapples with issues of identity and belonging.\n14. The document explores themes of isolation and loneliness.\n15. It h

In [27]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content='1. The document is a PDF version of "Frankenstein" by Mary Wollstonecraft Shelley.\n2. It contains a total of 277 pages.\n3. The document was created using Adobe InDesign CS2.\n4. The PDF format is version 1.5.\n5. The author, Mary Wollstonecraft Shelley, is known for her contributions to Gothic literature.\n6. The document was produced using Adobe PDF Library 7.0.\n7. The creation date of the document is February 6, 2008.\n8. The last modification date is July 6, 2008.\n9. This document is available for free download from Planet eBook.\n10. The text includes themes of creation and the consequences of scientific exploration.\n11. The story features a complex relationship between creator and creation.\n12. The protagonist, Victor Frankenstein, is a scientist who creates a living being.\n13. The creature, often referred to as Frankenstein\'s monster, grapples with issues of identity and belonging.\n14. The document explores themes of isolation and loneliness.\n15. It h

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

# Uncached
![image](imgs/uncached.png)

# Cached
![image](imgs/cached.png)