# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [2]:
import os
import dotenv

dotenv.load_dotenv()

True

In [2]:
# import os
# import getpass
# 
# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [3]:
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [5]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [4]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [8]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [5]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [6]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [9]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retrieverQuestion

    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}

    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))

    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [18]:
naive_retrieval_chain_with_scores = (
    # Instead of using naive_retriever, use similarity_search_with_score directly
    {"context": lambda x: vectorstore.similarity_search_with_score(x["question"], k=10), 
     "question": itemgetter("question")}
    
    # Process the (doc, score) tuples and extract scores
    | RunnablePassthrough.assign(
        context=lambda x: [doc for doc, score in x["context"]],  # Extract just docs
        scores=lambda x: [score for doc, score in x["context"]]   # Extract just scores
    )
    
    # Generate response and keep everything
    | {"response": rag_prompt | chat_model, 
       "context": itemgetter("context"),
       "scores": itemgetter("scores")}
)

In [None]:
import numpy as np

# Print the response text and scores
def print_response_and_scores(result, title="Similarity Scores", print_context=False):
    print(result["response"].content)
    scores = result["scores"]
    print(f"\n{title}")
    print("-" * len(title))
    for i, score in enumerate(scores, 1):
        print(f"Doc {i:2d} score: {score:.4f}")

    print(f"\nSummary:")
    print(f"Average score: {np.mean(scores):.4f}")
    print(f"Best:    {max(scores):.4f}")
    print(f"Worst:   {min(scores):.4f}")
    if print_context:
        print("\nContext:")
        for doc in result["context"]:
            print(f"- {doc.page_content}")

In [None]:
result = naive_retrieval_chain_with_scores.invoke({"question": "Did any complaints not get handled in a timely manner?"})
print_response_and_scores(result, title="naive_retrieval_chain_with_scores", print_context=False)

In [None]:

print_response_and_scores(result, title="naive_retrieval_chain_with_scores", print_context=False)

Based on the provided context, yes, some complaints did not get handled in a timely manner. Specifically, at least one complaint (Complaint ID: 12709087 submitted to MOHELA) was marked as "No" under the "Timely response?" column, indicating it was not responded to in a timely way. The other complaints, such as those to Maximus Federal Services and EdFinancial Services, were marked as "Yes," suggesting they were handled within the expected timeframe.

naive_retrieval_chain_with_scores
---------------------------------
Doc  1 score: 0.5304
Doc  2 score: 0.4860
Doc  3 score: 0.4820
Doc  4 score: 0.4675
Doc  5 score: 0.4597
Doc  6 score: 0.4590
Doc  7 score: 0.4580
Doc  8 score: 0.4558
Doc  9 score: 0.4536
Doc 10 score: 0.4500

Summary:
Average score: 0.4702
Best:    0.5304
Worst:   0.4500


In [70]:

naive_retrieval_chain_with_scores.invoke({"question" : 'How many Complaints marked as "Timely response?" had a non-positive conotation in the Timeliy response field?'})["response"].content

'Based on the provided data, there are 5 complaints marked as "Timely response?" with a "Yes" in the field. None of these complaints have a non-positive connotation in the "Timely response" field; they all indicate a positive or satisfactory response status. Since the question specifically asks about complaints with a non-positive connotation in the "Timely response" field, and all noted responses are positive ("Yes"), the answer is:\n\nZero complaints with a "Timely response?" marked as "Yes" had a non-positive connotation in the "Timely response" field.'

In [62]:
result = naive_retrieval_chain_with_scores.invoke({"question": "What is the most common issue with loans?"})
print_response_and_scores(result, title="naive_retrieval_chain_with_scores", print_context=False)

The most common issues with loans, based on the complaints provided, include:

- Errors and discrepancies in loan balances and account information
- Problems with repayment and payment application, such as difficulty applying extra funds or payments being misapplied
- Issues related to loan transfer and lack of proper notification
- Unfair or confusing interest rate increases and loan terms
- Problems with loan reporting and credit report inaccuracies
- Challenges with loan forgiveness, cancellation, or discharge
- Mishandling of loan data and violations of privacy laws

Overall, issues around mismanagement, inaccuracies, and poor communication between lenders/servicers and borrowers appear to be most prevalent.

naive_retrieval_chain_with_scores
---------------------------------
Doc  1 score: 0.5133
Doc  2 score: 0.4949
Doc  3 score: 0.4940
Doc  4 score: 0.4930
Doc  5 score: 0.4896
Doc  6 score: 0.4889
Doc  7 score: 0.4833
Doc  8 score: 0.4796
Doc  9 score: 0.4787
Doc 10 score: 0.4748

In [16]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints data, appears to be problems related to the handling and management of student loans, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and issues stemming from loan transfers and information inaccuracies. Many complaints also involve the inability to properly apply payments, inaccurate reporting of account status, and issues with loan balances growing despite payments made.\n\nIn summary, the most common issues are:\n- Errors and inaccuracies in loan balances and reporting\n- Difficulties in managing payments and payment application\n- Problems arising from loan transfer or mismanaged accounts\n- Disputes over loan information and account status\n\nIf you are experiencing a specific issue, it is often related to mismanagement or inaccuracies in loan data and handling.'

In [17]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints did not get handled in a timely manner. Specifically, the complaint with the ID 12709087 regarding issues with a graduated loan application submitted on 03/28/25 was marked as "Timely response?": No, indicating it was not handled promptly. Additionally, multiple complaints highlight delays and lack of responses over extended periods, such as over a year for certain account review requests and unresolved disputes.'

In [63]:
result = naive_retrieval_chain_with_scores.invoke({"question": "Did any complaints not get handled in a timely manner?"})
print_response_and_scores(result, title="naive_retrieval_chain_with_scores", print_context=False)

Based on the provided data, yes, there were complaints that did not get handled in a timely manner. Specifically, the complaint with Complaint ID 12709087, received on 03/28/25, was marked as "Timely response?": "No," indicating it was not handled in a timely manner. The complainant reported that despite multiple follow-ups, their issue remained unresolved and no response had been received for an extended period.

naive_retrieval_chain_with_scores
---------------------------------
Doc  1 score: 0.5304
Doc  2 score: 0.4860
Doc  3 score: 0.4820
Doc  4 score: 0.4675
Doc  5 score: 0.4597
Doc  6 score: 0.4590
Doc  7 score: 0.4580
Doc  8 score: 0.4558
Doc  9 score: 0.4536
Doc 10 score: 0.4500

Summary:
Average score: 0.4702
Best:    0.5304
Worst:   0.4500


In [None]:

result=naive_retrieval_chain_with_scores.invoke({"question" : 'How many Complaints marked as "Timely response?" had a negative  or undifined response in the Timely Response column?'})
print_response_and_scores(result, title="naive_retrieval_chain_with_scores", print_context=False)

Based on the provided data, there are 5 complaints that were marked as "Timely response?" with a response that is either negative, non-committal, or indicates no resolution (i.e., responses such as "Closed with explanation" or "Company has responded and chooses not to provide a public response"). These complaints show that even though the response was marked as "Yes" for timeliness, the actual content or context indicates unresolved issues or inadequate resolution.

To precisely answer the question: 

**Number of complaints marked as "Timely response?" with a negative or undefined response in the "Timely Response" column: 5.**

naive_retrieval_chain_with_scores
---------------------------------
Doc  1 score: 0.4394
Doc  2 score: 0.4341
Doc  3 score: 0.4299
Doc  4 score: 0.4273
Doc  5 score: 0.4198
Doc  6 score: 0.4188
Doc  7 score: 0.4152
Doc  8 score: 0.4152
Doc  9 score: 0.4127
Doc 10 score: 0.4116

Summary:
Average score: 0.4224
Best:    0.4394
Worst:   0.4116


In [18]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, including:\n\n1. **Accumulation of interest and inability to afford payments:** Many borrowers found that lowering monthly payments led to continued interest accumulation, which increased the total debt and extended the payoff period, making repayment more difficult overall.\n\n2. **Financial hardships and stagnant wages:** Borrowers often faced financial hardships due to economic conditions, stagnant wages, or unexpected expenses, which made it impossible to keep up with payments without sacrificing basic necessities.\n\n3. **Lack of clear or adequate communication:** Some borrowers were not properly informed about their loan status, changes in servicers, or the resumption of payments after forbearance or deferment periods. This lack of information caused missed payments and credit issues.\n\n4. **Administrative errors and mismanagement by loan servicers:** Transfer of loan accounts without proper notification, incorrect repo

In [14]:
# Function to compare retrievers with scores
def compare_retrievers_with_scores(question, retrievers_dict):
    """
    Compare multiple retrievers and their similarity scores
    
    Args:
        question: The query to test
        retrievers_dict: Dictionary of {name: retriever_function} pairs
    
    Returns:
        Dictionary with results for each retriever
    """
    results = {}
    
    for name, retriever_func in retrievers_dict.items():
        try:
            # Get documents with scores
            docs = retriever_func(question)
            scores = [doc.metadata.get("score", 0.0) for doc in docs if hasattr(doc, 'metadata')]
            
            results[name] = {
                "scores": scores,
                "avg_score": sum(scores) / len(scores) if scores else 0,
                "max_score": max(scores) if scores else 0,
                "min_score": min(scores) if scores else 0,
                "num_docs": len(docs)
            }
        except Exception as e:
            results[name] = {"error": str(e)}
    
    return results

# Example usage for comparing different retrievers
# You can add more custom retrievers here as you implement them


In [22]:
# Example: Create a BM25 retriever with scores for comparison
# Note: BM25 retrievers don't have built-in similarity_search_with_score, 
# so we'll simulate scores based on rank position
"""
from langchain_core.runnables import chain

@chain
def bm25_retriever_with_scores(query: str) -> List[Document]:
    #BM25 retriever that adds rank-based scores to document metadata
    docs = bm25_retriever.invoke(query)
    
    # BM25 doesn't provide similarity scores like vector search,
    # so we'll use rank-based scoring (higher rank = lower score)
    for i, doc in enumerate(docs):
        # Rank-based score: starts at 1.0 and decreases by 0.1 for each rank
        rank_score = max(0.1, 1.0 - (i * 0.1))
        doc.metadata["score"] = rank_score
        doc.metadata["rank"] = i + 1
    
    return docs

# Test comparison between naive (vector) and BM25 retrievers
test_question = "Did any complaints not get handled in a timely manner?"

retriever_comparison = compare_retrievers_with_scores(
    test_question,
    {
        "Naive (Vector)": naive_retriever_with_scores,
        "BM25 (Rank-based)": bm25_retriever_with_scores
    }
)

def print_retriever_comparison_results(retriever_comparison):
    print("Retriever Comparison Results:")from langchain_core.runnables import chainfrom langchain_core.runnables import chain
    print("="*40)
    for name, metrics in retriever_comparison.items():
        if "error" not in metrics:
            print(f"\n{name}:")
            print(f"  Average Score: {metrics['avg_score']:.4f}")
            print(f"  Max Score: {metrics['max_score']:.4f}")
            print(f"  Min Score: {metrics['min_score']:.4f}")
            print(f"  Documents Retrieved: {metrics['num_docs']}")
        else:
            print(f"\n{name}: Error - {metrics['error']}")

"""

SyntaxError: invalid syntax (3331941400.py, line 9)

In [23]:
# General function to create retrieval chains with scores
def create_retrieval_chain_with_scores(retriever_func, chain_name=""):
    """
    Create a retrieval chain that includes similarity scores
    
    Args:
        retriever_func: A retriever function that returns documents with scores in metadata
        chain_name: Optional name for the chain (for debugging)
    
    Returns:
        A LangChain LCEL chain that returns response, context, and scores
    
    def extract_scores_from_context(context):
        return [doc.metadata.get("score", 0.0) for doc in context]
    
    chain = (
        {"context": itemgetter("question") | retriever_func, "question": itemgetter("question")}
        | RunnablePassthrough.assign(
            context=itemgetter("context"),
            scores=lambda x: extract_scores_from_context(x["context"])
        )
        | {
            "response": rag_prompt | chat_model, 
            "context": itemgetter("context"),
            "scores": itemgetter("scores"),
            "question": itemgetter("question"),
            "chain_name": lambda x: chain_name
        }
    )
    
    return chain

# Create chains for different retrievers
# naive_chain_with_scores = create_retrieval_chain_with_scores(naive_retrieval_chain_with_scores, "Naive Vector")
bm25_chain_with_scores = create_retrieval_chain_with_scores(bm25_retriever_with_scores, "BM25")

print("Created retrieval chains with similarity scores!")
"""

In [None]:
# Comprehensive chain comparison with scores
def compare_chains_with_scores(question, chains_dict):
    """
    Compare multiple retrieval chains and their performance
    
    Args:
        question: The query to test
        chains_dict: Dictionary of {name: chain} pairs
    
    Returns:
        Dictionary with results for each chain
    """
    results = {}
    
    print(f"Testing Question: '{question}'")
    print("="*60)
    
    for chain_name, chain in chains_dict.items():
        print(f"\n{chain_name} Results:")
        print("-" * 30)
        
        try:
            result = chain.invoke({"question": question})
            scores = result["scores"]
            
            print(f"Response: {result['response'].content[:200]}...")
            print(f"\nScores Summary:")
            print(f"  Average: {sum(scores)/len(scores):.4f}")
            print(f"  Max: {max(scores):.4f}")
            print(f"  Min: {min(scores):.4f}")
            print(f"  Std Dev: {(sum([(s - sum(scores)/len(scores))**2 for s in scores])/len(scores))**0.5:.4f}")
            
            # Store results for analysis
            results[chain_name] = {
                "response": result['response'].content,
                "scores": scores,
                "avg_score": sum(scores)/len(scores),
                "max_score": max(scores),
                "min_score": min(scores),
                "num_docs": len(scores)
            }
            
        except Exception as e:
            print(f"Error: {e}")
            results[chain_name] = {"error": str(e)}
    
    return results

"""
Advanced_Retrieval_with_LangChain_Assignment.ipynb
# Test both chains
test_question = "Why did people fail to pay back their loans?"

chain_comparison = compare_chains_with_scores(
    test_question,
    {
        "Naive Vector Chain": naive_chain_with_scores,
        "BM25 Chain": bm25_chain_with_scores
    }
)
"""

In [None]:
result = naive_retrieval_chain_with_scores.invoke({"question": "What is the most common issue with loans?"})
print_response_and_scores(result, title="naive_retrieval_chain_with_scores", print_context=False)

The most common issues with loans, based on the complaints provided, include:

- Errors and discrepancies in loan balances and account information
- Problems with repayment and payment application, such as difficulty applying extra funds or payments being misapplied
- Issues related to loan transfer and lack of proper notification
- Unfair or confusing interest rate increases and loan terms
- Problems with loan reporting and credit report inaccuracies
- Challenges with loan forgiveness, cancellation, or discharge
- Mishandling of loan data and violations of privacy laws

Overall, issues around mismanagement, inaccuracies, and poor communication between lenders/servicers and borrowers appear to be most prevalent.

naive_retrieval_chain_with_scores
---------------------------------
Doc  1 score: 0.5133
Doc  2 score: 0.4949
Doc  3 score: 0.4940
Doc  4 score: 0.4930
Doc  5 score: 0.4896
Doc  6 score: 0.4889
Doc  7 score: 0.4833
Doc  8 score: 0.4796
Doc  9 score: 0.4787
Doc 10 score: 0.4748

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [24]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [25]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [21]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, it appears that the most common issues with loans, particularly federal student loans, involve problems with loan servicing and miscommunication or misinformation from lenders or servicers. Specific issues include:\n\n- Dealing with lenders or servicers who do not provide clear or accurate information.\n- Problems with applying payments correctly, especially applying additional funds to principal or paying off loans faster.\n- Difficulty in accessing or understanding loan information, such as balances, interest, or repayment plans.\n- Disputes over fees, interest calculations, and the validity of the schools attended.\n\nOverall, a prevalent theme is frustration with loan servicers handling payments, providing misinformation, or failing to offer transparent and accurate assistance. \n\nSo, the most common issue seems to be related to **poor communication and handling of loan payments and information by the loan servicers**.'

In [22]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, several complaints were responded to in a timely manner, as indicated by the "Timely response?" field being "Yes" for each complaint. There is no record of any complaints that were not handled in a timely manner.'

In [69]:

bm25_retrieval_chain.invoke({"question" : 'How many Complaints marked as "Timely response?" had a non-positive conotation in the Timeliy response field?'})["response"].content

'Based on the provided data, there are four complaints marked as "Timely response?" with a "Yes" in the field. Out of these four, three have consumer complaint narratives that express dissatisfaction, frustration, or negative sentiments, which can be interpreted as a non-positive connotation. These complaints are:\n\n1. Complaint with ID 13117781 (from row 480) - The narrative expresses frustration about loan forgiveness and the impact of COVID-19 on career prospects, indicating a negative or distressed connotation.\n2. Complaint with ID 12783455 (from row 508) - The narrative describes poor business practices, unhelpful responses, and feelings of frustration and lack of support, clearly non-positive.\n3. Complaint with ID 13001900 (from row 86) - The narrative discusses significant and devastating drops in credit scores, systemic issues, and feelings of being overwhelmed and betrayed, indicating a negative connotation.\n\nThe remaining complaint (ID 12933454, from row 61) has a neutra

In [None]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, several complaints were responded to in a timely manner, as indicated by the "Timely response?" field being "Yes" for each complaint. There is no record of any complaints that were not handled in a timely manner.'

In [23]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including difficulties with payment plans, miscommunication or lack of communication from the loan servicers, and issues with the handling of their accounts. Some specific reasons seen in the complaints include being steered into wrong types of forbearances, servicers not responding or providing timely assistance, unnotified transfers of loans to new companies without proper contact or consent, billing errors leading to wrongful charges or overdue statuses, and technical issues such as payments being reversed or not processed correctly despite available funds. Additionally, some borrowers experienced negative impacts on their credit scores due to lack of transparency or failure to notify them about changes or overdue statuses.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer #1:

In the present case we see that the naive_embeddings_retriever retrieved more information than the BM25:   
Embeddings:  

>*some complaints did not get handled in a timely manner. Specifically, the complaint with the ID 12709087 regarding issues with a graduated loan application submitted on 03/28/25 was marked as "Timely response?": No, indicating it was not handled promptly. Additionally, multiple complaints highlight delays and lack of responses over extended periods, such as over a year for certain account review requests and unresolved disputes.'*

BM25 (partial incomplete information)

>*Based on the provided information, several complaints were responded to in a timely manner, as indicated by the "Timely response?" field being "Yes" for each complaint. There is no record of any complaints that were not handled in a timely manner.*

BM25 is a RANKING algorithm within the Probabilistic Relevance Framework, which ranks documents according to their relevance to the user queries. It is focused on relevance (provides a relevance score)  rather than semantics. It is also looking for how often query terms (term frequency) appear in the documents. It also takes into account the length of the document, and the inverse document frequency.  

For example the BM25 Using this querry:

>*bm25_retrieval_chain.invoke({"question" : 'How many Complaints marked as "Timely response?" had a non-positive conotation in the Timeliy response field?'})["response"].content*

The response is:  

>*'Based on the provided data, there are four complaints marked as "Timely response?" with a "Yes" in the field. Out of these four, three have consumer complaint narratives that express dissatisfaction, frustration, or negative sentiments, which can be interpreted as a non-positive connotation. These complaints are:\n\n1. Complaint with ID 13117781 (from row 480) - The narrative expresses frustration about loan forgiveness and the impact of COVID-19 on career prospects, indicating a negative or distressed connotation.\n2. Complaint with ID 12783455 (from row 508) - The narrative describes poor business practices, unhelpful responses, and feelings of frustration and lack of support, clearly non-positive.\n3. Complaint with ID 13001900 (from row 86) - The narrative discusses significant and devastating drops in credit scores, systemic issues, and feelings of being overwhelmed and betrayed, indicating a negative connotation.\n\nThe remaining complaint (ID 12933454, from row 61) has a neutral or more procedural tone, focusing on requesting information without expressing overt dissatisfaction.\n\n**Therefore, the number of "Timely response?" complaints that had a non-positive connotation in their narratives is 3.**'*

Providing exact keywords in the query, the BM25 not only provided correct results, but it also gave the count.  

whereas the naive_retrieval_chain:  
>*result=naive_retrieval_chain_with_scores.invoke({"question" : 'How many Complaints marked as "Timely response?" had a non-positive conotation in the Timeliy response field?'})["response"].content*

Gave this incorrect response:  

>*'Based on the provided data, there are 5 complaints marked as "Timely response?" with a "Yes" in the field. None of these complaints have a non-positive connotation in the "Timely response" field; they all indicate a positive or satisfactory response status. Since the question specifically asks about complaints with a non-positive connotation in the "Timely response" field, and all noted responses are positive ("Yes"), the answer is:\n\nZero complaints with a "Timely response?" marked as "Yes" had a non-positive connotation in the "Timely response" field.'*

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [26]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [27]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to dealing with lenders or servicers, specifically including errors in loan balances, misapplied payments, wrongful denials of payment plans, incorrect or confusing information, and mishandling of loan data. Many complaints also involve lack of communication, incorrect information, unauthorized transfers, and privacy violations.'

In [27]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided data, there are indications that some complaints did not get handled in a timely manner. For example, one complaint regarding federal student loan servicing issues has been open since an unspecified date ("since XXXX") and still has not been resolved after nearly 18 months, despite the company response indicating the response was "timely." Additionally, multiple complaints mention waiting over a year for responses or resolutions, which suggests delays in handling these complaints. Therefore, yes, some complaints were not handled in a timely manner.'

In [28]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors such as a lack of proper information about their loans and repayment requirements, administrative issues, and financial hardship. Specifically, some borrowers were not aware that they needed to repay student loans or were never informed about the repayment process, resulting in unawareness or confusion. Administrative problems, such as transfers of loans without notification, difficulties accessing online accounts, or incorrect account information, further complicated repayment efforts. Additionally, borrowers faced financial challenges because the options available—like forbearance or deferment—often led to accruing interest, which increased the total debt over time and made it more difficult to pay off the loans. The accumulation of interest, combined with stagnant wages and unexpected financial burdens, contributed to many borrowers being unable to repay their loans fully.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [29]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [30]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [31]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be problems with how student loan servicers handle payments, including errors in loan balances, misapplied payments, refusal or difficulty in applying extra payments to principal, and extensive issues with loan documentation and validation. Many complaints also highlight poor communication, incorrect or inconsistent loan information, unauthorized transfers, and disputes over interest calculations and loan totals.'

In [32]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, based on the provided complaints, some complaints were not handled in a timely manner. For example, one complaint (Complaint ID: 12709087) received by MOHELA was marked as "Timely response?": No, indicating it was not handled promptly. Additionally, multiple complaints mention delays of over a year, months, or weeks without resolution, confirming that certain issues did not get addressed in a timely manner.'

In [33]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of systemic issues, lack of clear information, and financial hardships. Specifically, many borrowers were not adequately informed about how forbearance and deferment options work, particularly that interest would continue to accrue and compound, making the total debt grow faster than expected. Some were steered into long-term forbearances or consolidations without being informed about alternative options like income-driven repayment or loan rehabilitation, which could have helped manage or reduce their debt. \n\nAdditionally, borrowers faced challenges like sudden transfers between loan servicers, incorrect reporting of account statuses, or being kept in forbearance without proper communication. Many have experienced unaffordable payment demands, increased loan balances due to interest capitalization, and a lack of support or guidance from loan servicers, all of which contribute to their inability to pay back their l

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.


#### ✅ Answer #2:

$\text{Recall} = \Large{\frac{\text{True Positives}} {\text{True Positives} + \text{False Negatives}}}$

Therefore the more the datapoints (i.e. context) the better the recall gets, especially if the false negatives decrease. As we saw, once we added more querries, the model with just the naive retriecer, was able to identify:   
>"based on the provided complaints, some complaints were not handled in a timely manner.For example, one complaint (Complaint ID: 12709087) received by MOHELA was marked as "Timely response?": No, indicating it was not handled promptly. Additionally, multiple complaints mention delays of over a year, months, or weeks without resolution, confirming that certain issues did not get addressed in a timely manner."  

Whereas before, the naive retriever with just one query gave:  
>"Based on the provided information, several complaints were responded to in a timely manner, as indicated by the "Timely response?" field being "Yes" for each complaint. There is no record of any complaints that were not handled in a timely manner"

Which is *False Negative", and thus decreases Recall.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [32]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [None]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", 
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"), 
    client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [34]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [35]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [36]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [None]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be related to errors and misconduct in federal student loan servicing. Specific recurring problems include incorrect information on credit reports, misapplication of payments, wrongful denials of payment plans, discrepancies in loan balances and interest rates, and issues with collection and verification of debts.'

In [None]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, it appears that several complaints were not handled in a timely manner. Specifically, the complaints related to the student loan issues with MOHELA (Complaint IDs 12709087 and 12935889) indicate that the responses were "No" in the "Timely response?" field, meaning they were not handled promptly. Additionally, the complaint about the dispute settlement with Nelnet (Complaint ID 13205525) was responded to within the expected timeframe ("Yes" in "Timely response?"). \n\nTherefore, yes, some complaints—particularly those regarding MOHELA—did not get handled in a timely manner.'

In [None]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including:\n\n1. Lack of proper communication or notification from loan servicers about payment obligations, as indicated by complaints about not being notified when payments were due or about changes in loan ownership.\n2. Financial hardship or severe economic difficulties that made it impossible to make timely payments, such as unemployment or inability to find employment in their field.\n3. Misrepresentation or lack of transparency from educational institutions and loan providers regarding the long-term financial consequences, job prospects after graduation, and the sustainability of the school’s operations.\n4. Relying on deferment and forbearance options that increased interest and debt over time.\n5. Disputes over the legitimacy or ownership of the debt, including issues related to the legal verification of loans and deceptive practices by collection agencies.\n6. Personal health issues or other personal circumstances th

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [37]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [38]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [None]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided data, appears to be dealing with the loan servicer or lender, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and problems with how payments are being handled. Several complaints highlight issues such as receiving bad information about loans, inability to properly apply payments to principal, inaccurate reporting of delinquency, and mishandling of loan transfers or consolidations. \n\nIn summary, a predominant and recurring problem is the mismanagement and poor communication from loan servicers, which leads to misapplied payments, incorrect account information, and difficulties in resolving repayment issues.'

In [None]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, there are several instances indicating complaints not handled in a timely manner. For example:\n\n- One complaint (#12935889) about Mohela was marked as "Timely response?": No.\n- Another (#12744910) regarding inaccuracies in reporting and an ongoing dispute was "Timely response?": Yes, but the complaint was about inaccurate reporting and delays in correction, suggesting the issue persisted over time.\n- Multiple complaints (#12739706, #13062402, #13126709, #13127090, and others) mention delays, extended wait times, or responses that were not addressed promptly, with some even explicitly stating they did not receive responses within expected timeframes.\n- There are cases where the response was "Closed with explanation" but the delays or unresolved issues strongly imply they were not handled promptly or adequately.\n\nOverall, the evidence suggests that at least some complaints were not handled in a timely manner, as indicated directly by the res

In [None]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, often related to mismanagement, misinformation, and systemic issues. Based on the provided complaints, common reasons include:\n\n1. **Lack of Notification and Communication:** Many borrowers were not properly notified about loan transfers, due dates, or repayment start dates, leading to unintentional delinquency and missed payments.\n\n2. **Misleading or Incomplete Information:** Borrowers reported receiving incorrect or misleading information about their loan balances, repayment obligations, or eligibility for programs like income-driven repayment or forgiveness, which caused confusion and unintended default.\n\n3. **System Errors and Technical Difficulties:** Issues such as online portal lockouts, incorrect account statuses, and errors in reporting contributed to borrowers not making payments or being marked delinquent improperly.\n\n4. **Inadequate Support and Assistance:** Borrowers often found customer service unhelpful,

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [39]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [40]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [44]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [45]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [46]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [None]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issues with loans appear to be related to difficulties in communication and account management, such as:\n\n- Struggling to repay loans due to errors or issues with payment plans.\n- Problems with loan reporting, including incorrect or improper reporting of account status or default.\n- Difficulties in obtaining clear information about loan balances, loan servicer changes, or payment amounts.\n- Issues with loan servicing companies failing to respond appropriately or failing to verify or process applications.\n- Unauthorized or illegal reporting and collection practices, including violations of privacy laws.\n\nWhile these are specific to student loans in the context provided, a recurring theme is that many complaints involve mismanagement, lack of transparency, or errors in the handling of loans and related information. \n\nTherefore, a common underlying issue with loans, especially highlighted here, is **mismanagement or errors in se

In [None]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that many complaints were responded to in a timely manner, with responses marked as \'Yes\' under the \'Timely response?\' field. Notably, several complaints state "Closed with explanation," indicating that they were addressed within the required time frame. \n\nHowever, there is at least one complaint regarding a lack of response or handling—specifically, the complaint about Nelnet (row 17). The consumer\'s narrative details multiple issues with lack of responses and conduct that suggests their complaint was not handled promptly or satisfactorily.\n\nIn summary:\n\n- Multiple complaints confirm responses were handled in a timely manner.\n- One complaint (about Nelnet\'s failure to respond to Certified Mail and ongoing misconduct) indicates that the complaint was not properly handled or responded to, suggesting that some complaints did not get handled in a timely manner.\n\nTherefore, yes, some complaints did not get handled in a timely man

In [None]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including issues such as difficulties dealing with their loan servicers, miscommunications or inadequate information about their loan status, problems with payment processing, and disputes over the legitimacy or accuracy of their loan details. Some specific reasons noted in the complaints include receiving bad information about loan statuses, delays or errors in re-amortizing payments after forbearance ended, and inaccurate reports of default or delinquency. Additionally, instances of alleged mismanagement, lack of transparency, or improper handling of personal data have also contributed to borrowers' difficulties in repayment."

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### ✅ Answer #3:

1. In the above example we set the threshold to 20%. In FAQ-like sentences, a low % might not work well because many adjacent chunks may be too similar. So, the 1st method would be to try a higher 50th-75th percentile to detect topic shifts.
2. Try a hybrid approach to switch to topic level.  
        * sim(QA_i, QA_i+1) < percentile(sim_all_QA_pairs, threshold)
3. If it is FAQs, We could pre-process the docs to treat pairs as a unit. e.g.
        * chunks = [f"Q: {q}\nA: {a}" for q, a in faq_pairs]

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

In [70]:
len(loan_complaint_data)

825

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [None]:
### YOUR CODE HERE
"""
Our data is: loan_complaint_data

"""

In [76]:
# LangSmith key and project to track latency and cost
from uuid import uuid4
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Retrievers-eval-{uuid4().hex[0:8]}"

In [None]:
# Test that tracing is working
# OpenAI is emptying my bank account
"""
from langchain_openai import ChatOpenAI

# This should now be traced in LangSmith
test_llm = ChatOpenAI(model="gpt-4.1-nano")
test_result = test_llm.invoke("Hello, testing LangSmith tracing")
print("LangSmith tracing enabled!")
"""

In [53]:
import ragas
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [55]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(loan_complaint_data, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/539 [00:00<?, ?it/s]

unable to apply transformation: Invalid json output: The borrower’s loan was in forbearance, but they unexpectedly received notice from Sloan Servicing that they were 90 days past due and owed over $9000, without being given a chance to cure the default. Their credit score dropped significantly after the servicer reported them to the credit bureau without prior contact. Attempts to request forbearance and an income-driven repayment plan were unsuccessful due to denial and website issues, so the borrower mailed the required forms with proof of income. The loan status online is unclear, and the borrower also experienced a breach of personal and financial data, violating FERPA. They are requesting full cancellation of their student debt.
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 


Applying CustomNodeFilter:   0%|          | 0/825 [00:00<?, ?it/s]

Node 80fb4b28-5494-4bd7-aff9-c92037e8d423 does not have a summary. Skipping filtering.
Node fd7b6455-e499-4049-877c-c1739e835ac8 does not have a summary. Skipping filtering.
Node bbd18c7b-b3eb-41be-89b8-9670ac2658dc does not have a summary. Skipping filtering.
Node cbd3a9aa-c4af-477d-88ee-88791117824a does not have a summary. Skipping filtering.
Node eafc86c8-351e-4a82-8159-d9da208ac12f does not have a summary. Skipping filtering.
Node 023fa053-ace3-4d99-9daf-e3807ac5b4d8 does not have a summary. Skipping filtering.
Node 8b078791-c28b-41b4-844b-0c2161e15047 does not have a summary. Skipping filtering.
Node a3573a58-1f61-4a09-a6dd-cdb8ee0db360 does not have a summary. Skipping filtering.
Node 619d2c79-90f0-4924-97a3-b00484e21c43 does not have a summary. Skipping filtering.
Node 57c65ed4-8b53-4777-8425-1658a29865b3 does not have a summary. Skipping filtering.
Node f5b7b882-02eb-4739-9334-1f356316ae1a does not have a summary. Skipping filtering.
Node f8d422f7-ccb7-499c-859b-8a18eb9a0f79 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/2189 [00:00<?, ?it/s]

unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

unable to apply transformation: Node fd664ec9-6885-41fa-9cb9-a010352e1213 has no summary_embedding


Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [56]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How did Nelnet handle the re-amortization of f...,[The federal student loan COVID-19 forbearance...,Payments on federal student loans serviced by ...,single_hop_specifc_query_synthesizer
1,Why they give me wrong payment on IDR/IBR plan...,[I submitted my annual Income-Driven Repayment...,Even though I submitted my annual IDR recertif...,single_hop_specifc_query_synthesizer
2,cancl my studnt loan debt?,[My personal and financial data was compromise...,I request full cancellation of my student loan...,single_hop_specifc_query_synthesizer
3,Why studentaid.gov say nelnet my issuer but ne...,"[According to Studentaid.gov, Im to get an ema...",Studentaid.gov says that your issuer is nelnet...,single_hop_specifc_query_synthesizer
4,Why they say I in forbearance till 2040 but I ...,[Since the resumption of federal loan payments...,They told me I was in forbearance until 2040 b...,single_hop_specifc_query_synthesizer
5,How did EdFinancial Services mishandle my stud...,"[<1-hop>\n\nTo Whom It May Concern, I am writi...",EdFinancial Services mishandled your student l...,multi_hop_specific_query_synthesizer
6,How have miscommunications and errors involvin...,[<1-hop>\n\nI am filing a complaint regarding ...,Miscommunications and errors involving Edfinan...,multi_hop_specific_query_synthesizer
7,How does the role of a federal student loan se...,[<1-hop>\n\nU.S. Department of Education\nFede...,A federal student loan servicer like Navient i...,multi_hop_specific_query_synthesizer
8,why edfinancial keep messin up my loan and mak...,[<1-hop>\n\nI contacted Edfinancial about my f...,edfinancial messed up by addin $12000.00 capit...,multi_hop_specific_query_synthesizer
9,How has the involvement of the loan servicer a...,[<1-hop>\n\nI am a veteran that is 100 % XXXX ...,The involvement of the loan servicer has signi...,multi_hop_specific_query_synthesizer


In [58]:
type(dataset)

ragas.testset.synthesizers.testset_schema.Testset

In [4]:
import pandas as pd

pd.set_option("display.max_colwidth", None)     # Show entire column text
pd.set_option("display.max_columns", None)      # Show all columns
pd.set_option("display.max_rows", 100)          # Adjust as needed
pd.set_option("display.width", 0)               # Let it wrap naturally


In [5]:
df = dataset.to_pandas()
from IPython.display import display, HTML

html=(df.to_html(max_rows=None, max_cols=None))
scroll_html = f"""
<div style="height:400px; overflow:auto; border:1px solid #ccc">
{html}
</div>
"""
display(HTML(scroll_html))

NameError: name 'dataset' is not defined

In [68]:
#   Converting df into ragasEvaluationDataset
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(df)

In [69]:
# Selecting a judge model
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

In [78]:
# Collect all retrievers in a dictionary 
retrievers_dict = {
    "naive_retriever": naive_retriever,
    "bm25_retriever": bm25_retriever, 
    "compression_retriever": compression_retriever,
    "multi_query_retriever": multi_query_retriever,
    "parent_document_retriever": parent_document_retriever,
    "ensemble_retriever": ensemble_retriever,
    "semantic_retriever": semantic_retriever
}

print(f"Total retrievers to evaluate: {len(retrievers_dict)}")
for name in retrievers_dict.keys():
    print(f"- {name}")


Total retrievers to evaluate: 7
- naive_retriever
- bm25_retriever
- compression_retriever
- multi_query_retriever
- parent_document_retriever
- ensemble_retriever
- semantic_retriever


In [2]:
def add_all_retrievers_to_dataset(dataframe, retrievers_dict, k=10):
    """
    Add retrieved_contexts columns for ALL retrievers to a single dataset
    
    Returns:
        DataFrame with columns like:
        - user_input, reference, reference_contexts (original)
        - retrieved_contexts_naive
        - retrieved_contexts_bm25  
        - retrieved_contexts_compression
        - etc.
    """
    df_wretriever_context = dataframe.copy()
    
    for retriever_name, retriever in retrievers_dict.items():
        print(f"Adding {retriever_name} results...")
        
        retrieved_contexts_list = []
        for i, row in df_wretriever_context.iterrows():
            question = row['user_input']
            try:
                # Configure and run retriever
                if hasattr(retriever, 'search_kwargs'):
                    retriever.search_kwargs = {"k": k}
                elif hasattr(retriever, 'k'):
                    retriever.k = k
                    
                docs = retriever.invoke(question)
                retrieved_contexts = [doc.page_content for doc in docs]
                retrieved_contexts_list.append(retrieved_contexts)
            except Exception as e:
                print(f"Error with {retriever_name} on question {i}: {e}")
                retrieved_contexts_list.append([])
        
        # Add column for this retriever
        df_wretriever_context[f'{retriever_name}_contexts'] = retrieved_contexts_list
    
    return df_wretriever_context

In [3]:
df_all_retrievers_context = add_all_retrievers_to_dataset(df, retrievers_dict, k=10)

NameError: name 'df' is not defined

In [None]:
def add_retriever_column_to_synth_df(retriever, retriever_name, dataframe, k=10):
    """
    Evaluate a single retriever using the test dataset
    
    Args:
        retriever: The retriever to evaluate
        retriever_name: Name for identification
        dataframe: Pandas dataframe of thesynthetic evaluation dataset
        k: Number of documents to retrieve
    
    Returns:
        EvaluationDataset with retrieved contexts populated
    """
    print(f"Evaluating {retriever_name}...")
    
    # For each question in dataset, get retrieved contexts
    retrieved_contexts_list = []
    df = dataframe.copy()

    for i, row in df.iterrows():
        question = row['user_input']
        
        try:
            # Configure retriever to return k documents
            if hasattr(retriever, 'search_kwargs'):
                retriever.search_kwargs = {"k": k}
            elif hasattr(retriever, 'k'):
                retriever.k = k
                
            # Get retrieved documents
            docs = retriever.invoke(question)
            
            # Extract just the text content
            retrieved_contexts = [doc.page_content for doc in docs]
            retrieved_contexts_list.append(retrieved_contexts)
            
        except Exception as e:
            print(f"Error retrieving for question {i}: {e}")
            retrieved_contexts_list.append([])  # Empty list if error
    
    # Add retrieved contexts to dataframe
    df[f'{retriever_name}_contexts'] = retrieved_contexts_list
    
    # Convert back to EvaluationDataset
    retriever_dataset = EvaluationDataset.from_pandas(df)
    
    print(f"✓ Completed evaluation for {retriever_name}")
    return evaluated_dataset


In [None]:
# Evaluate all retrievers
retriever_results = {}

for retriever_name, retriever in retrievers_dict.items():
    try:
        evaluated_dataset = add_retriever_column_to_synth_df(retriever, retriever_name, df, k=10)
            retriever, 
            retriever_name, 
            evaluation_dataset, 
            k=10
        )
        retriever_results[retriever_name] = evaluated_dataset
        print(f"Successfully evaluated {retriever_name}\n")
        
    except Exception as e:
        print(f"Failed to evaluate {retriever_name}: {e}\n")
        continue

print(f"Successfully evaluated {len(retriever_results)} out of {len(retrievers_dict)} retrievers")


In [None]:
# Import ragas metrics for retrieval evaluation
from ragas.metrics import context_precision, context_recall

# Metrics to use for retrieval evaluation  
retrieval_metrics = [context_precision, context_recall]

print("Available metrics for retrieval evaluation:")
for metric in retrieval_metrics:
    print(f"- {metric.name}")


In [None]:
from ragas.metrics import LLMContextRecall, context_precision, context_recall, ContextEntityRecall
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(),context_precision, context_recall],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

===========================================================================================================================

In [71]:
print(f"Dataset type: {type(dataset)}")
print(f"Dataset length: {len(dataset) if hasattr(dataset, '__len__') else 'No length'}")

Dataset type: <class 'ragas.testset.synthesizers.testset_schema.Testset'>
Dataset length: 10


In [72]:
if hasattr(dataset, 'to_pandas'):
    df = dataset.to_pandas()
    print(f"Columns: {list(df.columns)}")
    print(f"Number of rows: {len(df)}")

Columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']
Number of rows: 10


In [73]:
if len(dataset) > 0:
    print("First item structure:")
    print(f"Has user_input: {'user_input' in df.columns}")
    print(f"Has reference: {'reference' in df.columns}")  
    print(f"Has reference_contexts: {'reference_contexts' in df.columns}")

First item structure:
Has user_input: True
Has reference: True
Has reference_contexts: True
