# **Level 4: The Quest: Retrieval - Retrieving relevant knowledge for your AI**

## Part 2: Keyword Search – The Precise Path


Hello everyone, and welcome back to **Level 4: The Quest for Retrieval**. In our last session, we got a bird's-eye view of the landscape, understanding that retrieval is all about finding the right pieces of knowledge from our archives to help our LLM generate the best possible answer.

We introduced the main paths on this quest: **Keyword**, **Sparse**, **Dense**, and **Hybrid** search. Each has its own strengths and is suited for a different kind of journey.

Today, we're taking our first step down a specific path. We're diving deep into **Keyword Search**, the oldest, simplest, and most direct form of retrieval. Don't let its simplicity fool you; it's a powerful tool that you'll use constantly, even within more advanced systems.

Think of keyword search like using a magnifying glass to find exact words or phrases in a book. Sometimes, that's *exactly* what you need\!

-----

## What is Keyword Search? (The Basics of Exact Matching)

At its core, the concept is incredibly straightforward.

**Simple Definition:** Keyword search is a retrieval method that identifies documents or text segments based on the **literal presence** of specific words or phrases (keywords) from your query.

### How it Works (A Conceptual Look)

Under the hood, keyword search is all about **pattern matching**. The system scans the text and looks for an exact, character-for-character match of the terms in your query. If the sequence of characters in your query exists in the document, you have a match. It's a binary, yes-or-no operation.

This is fundamentally different from how humans understand language. When you hear the word "car," your brain instantly connects it to related concepts like "automobile," "vehicle," "driving," and "transportation." A basic keyword search system does not. For it, "car" is just the sequence of letters c-a-r. If a document contains "automobile" but not "car," a keyword search will completely miss it.

Let's look at a quick, illustrative example:

  * **Your Query:** "How do I reset my password?"
  * **Keywords:** "reset", "password"

Now, let's see how it would fare against a few documents in our knowledge base:

  * **Document 1:** "To **reset** your **password**, please follow the instructions on our website."
      * **Result:** **Match\!** Both keywords are present exactly as typed.
  * **Document 2:** "For account security, you may need to change your passphrase. A passphrase is a sequence of words..."
      * **Result:** **No match.** Even though "change your passphrase" is semantically identical to "reset your password," the exact keywords are missing.
  * **Document 3:** "If you have trouble with your **password**, this guide will help. The first step in any **reset** process is..."
      * **Result:** **Match\!** The words are present, even though they are in a different order and separated by other text.

This simple example reveals both the power and the primary weakness of this approach.

-----

## Strengths of Keyword Search (When Precision Matters)

Why would we use such a seemingly rigid method? Because sometimes, rigidity is exactly what we need.

  * **High Precision for Exact Matches:** This is its greatest strength. If you need to find a document that contains a specific, non-negotiable term, keyword search is your most reliable tool. Think of searching for unique identifiers like product codes (`XJ-48-B2`), error messages (`Error 0x80070005`), or legal clauses (`force majeure`). You don't want a "similar" error code; you want the *exact* one.
  * **Simplicity & Speed:** Conceptually, it's the easiest search method to understand. In practice, systems built for keyword search (like traditional search engines) are highly optimized and can be incredibly fast. They often use an "inverted index," which is like the index at the back of a book, mapping each word to every document it appears in.
  * **Interpretable Results:** The results are never a mystery. A document was returned because it contains the exact word(s) you searched for. There's no complex algorithm or "black box" deciding what's relevant. This transparency is invaluable for debugging and building trust in your RAG system.
  * **Cost-Effective:** Keyword search does not require generating expensive embeddings for your documents or queries. The indexing and search operations are computationally cheaper than the vector math involved in semantic search, which can lead to lower operational costs.

-----

## Limitations of Keyword Search (The Lexical Gap Problem)

While keyword search is precise, that precision comes at a great cost. Its primary weakness is a concept so important that it has its own name: the **Lexical Gap**.

### The "Lexical Gap" Explained

The lexical gap is the mismatch between the words a user types into a search box and the words that are actually used in the documents they are looking for. It's the reason our "reset password" vs. "change passphrase" example failed.

Think about all the ways you could phrase the same idea:

  * **Query:** "find cheap flights"
  * **Possible Document Phrasings:**
      * "affordable airfare"
      * "low-cost plane tickets"
      * "search for budget travel"

A pure keyword search for "cheap flights" would miss all of these perfectly relevant documents. This leads to several significant problems:

  * **No Semantic Understanding:** The system doesn't understand that "car" and "automobile" mean the same thing. It also can't differentiate between contexts. A search for "Apple" would return documents about the fruit and the technology company indiscriminately, because it only sees the letters a-p-p-l-e.
  * **Morphological Blindness:** It often struggles with different forms of the same word. A search for "run" might not find documents containing "running" or "ran".
  * **Low Recall:** Recall is a measure of how many of the *total* relevant documents were actually found. Because of the lexical gap, keyword search often misses relevant information, resulting in low recall. This is often called a "false negative" – the information was there, but the system failed to find it.

> #### **Key Takeaway: Precision vs. Recall**
>
>   * **Precision:** Of the documents you retrieved, how many were actually relevant? Keyword search is often **high-precision** (if it finds something with a specific term, it's likely relevant).
>   * **Recall:** Of all the relevant documents that exist in your database, how many did you find? Keyword search is often **low-recall** (it misses relevant documents due to wording differences).
>
> In RAG, we often battle between these two. We want to find *all* the relevant information (high recall) without including a lot of junk (high precision).

-----

## When to Use Keyword Search in RAG Applications

Given its limitations, you might wonder if keyword search has a place in modern, sophisticated RAG systems. The answer is a resounding **yes**, but rarely alone.

1.  **Searching for Specific Identifiers:** This is the killer use case. When your query involves product SKUs, employee IDs, transaction numbers, or specific technical jargon that has no synonyms, keyword search is superior to semantic search.
2.  **When Exact Phrasing is Critical:** Think of legal research ("must be in writing"), compliance ("shall not be disclosed"), or finding a specific quote from a company all-hands meeting.
3.  **As a Pre-Filter:** You can use a keyword filter to dramatically narrow down the search space *before* applying a more computationally expensive semantic search. For example, if a user asks, "What are the Q3 sales goals for the 'Hydra' project?", you could first filter all documents to only those tagged with "Hydra" and "Q3," and *then* run a semantic search on that much smaller subset.
4.  **As Part of a Hybrid Search Strategy:** This is the most common and powerful application. You run a keyword search *and* a semantic search simultaneously and then combine the results. This gives you the best of both worlds: the precision of keyword matching and the broad understanding of semantics. We'll cover this in detail later in the course.

-----

## Implementing Keyword-Like Search in LangChain

This brings us to the practical part. How do we actually *do* this?

**An important note:** The core `VectorStoreRetriever` that you've been using is designed for **semantic search**. It takes your query, embeds it into a vector, and looks for other vectors that are close by in geometric space. LangChain doesn't have a built-in, standalone "KeywordSearchEngine" in the same way, because its philosophy is rooted in the power of LLMs and embeddings.

However, we can achieve the *effect* of keyword search in two primary ways with the tools you already know.

### Approach 1: Leveraging Metadata Filters (Most Practical and Recommended)

This is the most common and scalable way to integrate keyword-style filtering into your RAG workflow. You already know that `Document` objects have a `page_content` field and a `metadata` dictionary. We can strategically use that `metadata` field to our advantage.

During your indexing phase, you can enrich your chunks with metadata tags that act as keywords.

Let's see this in action. Imagine we have internal company documents about different projects.

```python
# First, let's set up our environment
# Make sure you have langchain, langchain_openai, langchain_community, chromadb, and python-dotenv installed
# pip install langchain langchain_openai langchain_community chromadb python-dotenv

import os
from dotenv import load_dotenv

from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Load environment variables from .env file
load_dotenv()

# Ensure you have your OPENAI_API_KEY set in your .env file
# OPENAI_API_KEY="sk-..."

# 1. Create Documents with Rich Metadata

# Notice how we're adding specific, filterable tags to the metadata
docs = [
    Document(
        page_content="Project Phoenix is focused on leveraging generative AI to improve customer support. The target launch is Q4 2025.",
        metadata={"project_code": "PHX-001", "team": "AI Research", "status": "active", "quarter": "Q4"},
    ),
    Document(
        page_content="The final budget for Project Phoenix has been approved at $1.5M. All expenditures must be tracked.",
        metadata={"project_code": "PHX-001", "team": "Finance", "status": "approved", "quarter": "Q3"},
    ),
    Document(
        page_content="Project Titan aims to overhaul our cloud infrastructure for better scalability. This is a high-priority initiative for Q4.",
        metadata={"project_code": "TTN-002", "team": "Infrastructure", "status": "active", "quarter": "Q4"},
    ),
    Document(
        page_content="A security audit for Project Titan is scheduled for next week. All team members must complete the pre-audit checklist.",
        metadata={"project_code": "TTN-002", "team": "Security", "status": "pending_audit", "quarter": "Q3"},
    ),
]

# 2. Initialize Embeddings and Vector Store
# We still need an embedding model because Chroma is a vector store at heart.
embedding_function = OpenAIEmbeddings()
vector_store = Chroma.from_documents(docs, embedding_function)

# 3. Perform a Search with a Keyword Filter

# Let's say we want to find all documents related to the "Phoenix" project.
# We can use the 'project_code' as our precise keyword.
# The query itself can be semantic, but the filter is a hard keyword match.

query = "What is the budget?"
phoenix_docs = vector_store.similarity_search(
    query,
    # This is the key part! The 'filter' argument performs a metadata search.
    filter={"project_code": "PHX-001"}
)

print("--- Search Results for Project Phoenix (PHX-001) ---")
for doc in phoenix_docs:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

# Now let's find all active projects in Q4
query_q4 = "What's happening this quarter?"
q4_active_docs = vector_store.similarity_search(
    query_q4,
    filter={"status": "active", "quarter": "Q4"} # You can combine filters!
)

print("--- Search Results for Active Q4 Projects ---")
for doc in q4_active_docs:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")
```

What happened here? We performed a semantic search for "What is the budget?", but we told ChromaDB to *only* look at documents where the `metadata` dictionary contained `"project_code": "PHX-001"`. This isn't a pure keyword search across the `page_content`, but it's an incredibly effective and common pattern: **using keywords in metadata to filter and scope a semantic search.**

### Approach 2: Simple Pythonic Filtering (For Conceptual Understanding)

What if you wanted to replicate the *actual mechanism* of searching the `page_content` for a keyword? You could do this with a simple Python function, especially within a LangChain Expression Language (LCEL) chain.

**Disclaimer:** This method is for demonstration and learning. It is **not scalable**. Searching through a list of thousands of documents in memory like this would be very slow. Real-world systems use optimized indexing for this. But it's perfect for understanding the concept.

Let's build a simple chain that takes a list of documents and filters them with a basic Python `if "keyword" in text:` check.

```python
from langchain_core.runnables import RunnableLambda

# We will reuse the 'docs' list from the previous example.
# Our simple list of 4 documents will act as our entire database.
all_documents = docs

def simple_keyword_filter(inputs: dict) -> list[Document]:
    """
    A simple, non-scalable function to filter documents based on a keyword.
    'inputs' is expected to be a dictionary with 'documents' and 'keyword' keys.
    """
    keyword = inputs["keyword"].lower()
    input_docs = inputs["documents"]
    
    # This is the core logic: iterate and check for the substring.
    # We use .lower() to make the search case-insensitive, a common preprocessing step.
    filtered_docs = [
        doc for doc in input_docs if keyword in doc.page_content.lower()
    ]
    return filtered_docs

# We can wrap this function in a RunnableLambda to use it in a chain.
keyword_filter_runnable = RunnableLambda(simple_keyword_filter)

# Let's test it! We want to find documents that contain the word "security".
results = keyword_filter_runnable.invoke({
    "documents": all_documents,
    "keyword": "security"
})

print("--- Results from Simple Pythonic Keyword Filter for 'security' ---")
for doc in results:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

# Let's try another one for "budget"
results_budget = keyword_filter_runnable.invoke({
    "documents": all_documents,
    "keyword": "budget"
})

print("--- Results from Simple Pythonic Keyword Filter for 'budget' ---")
for doc in results_budget:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

```

This code directly illustrates the "pattern matching" concept we discussed. It's crude but effective for showing the mechanism. It also highlights why preprocessing is important—the `lower()` call prevented us from missing a match if the case was different.

-----

## Troubleshooting & Best Practices for Keyword Search

When you rely on keywords, you need to be mindful of their pitfalls.

  * **Lexical Gap Awareness:** This is rule \#1. If you're getting poor results (low recall), the first thing to suspect is the lexical gap. Are there synonyms or alternative phrasings you're not accounting for?
  * **Preprocessing is Key:** As shown in our Pythonic example, simple normalization can make a huge difference.
      * **Lowercasing:** Convert all text to lowercase to prevent mismatches like "Apple" vs. "apple".
      * **Punctuation Removal:** Decide if punctuation should be removed. Does "run." match "run"?
  * **Stemming and Lemmatization (Brief Mention):** These are more advanced text preprocessing techniques.
      * **Stemming:** Chops words down to a common "stem" (e.g., "running", "ran" -\> "run"). It's fast but can be crude.
      * **Lemmatization:** Uses linguistic rules to reduce words to their dictionary form, their "lemma" (e.g., "was", "is" -\> "be"; "better" -\> "good"). It's more accurate but slower. You could theoretically integrate these into a `RunnableLambda` for more robust matching.
  * **Query Expansion (Brief Mention):** This is the practice of automatically adding synonyms to a user's query. If a user searches for "car," the system might expand the search to `("car" OR "automobile" OR "vehicle")`. This directly combats the lexical gap.
  * **Metadata Strategy:** If you plan to use metadata filtering (Approach 1), you need a good strategy. Plan your tags during the data ingestion phase. Keep them consistent. Is it `"team": "AI"` or `"team": "AI Research"`? Consistency is crucial for reliable filtering.

-----

## Connecting to the Retrieval Workflow

Let's update our mental model of the RAG pipeline to see exactly where keyword-based filtering fits. It's not a replacement for the vector store, but a powerful companion to it.

```mermaid
graph TD
    subgraph "Indexing (The Archives)"
        A[Raw Data] --> B{Document Loaders};
        B --> C[Documents];
        C -- "Enrich with keyword tags!" --> D{Text Splitters};
        D --> E[Chunks with Metadata];
        E --> F{Embedding Model};
        F --> G[Vector Embeddings];
        G & E --> H[Vector Store];
    end

    subgraph "Retrieval & Generation (The Quest)"
        I[User Query] --> J{Query Pre-processing};
        J --> K["Keyword Search/Filtering <br/>(e.g., on Metadata in Vector Store)"];
        K --> L[Filtered Candidate Chunks];
        L -- "These can be the final context, <br/> or passed to another search step" --> M[Consolidated Context];
        M --> N[Prompt Template];
        N --> O[LLM];
        O --> P[Generated Answer];
    end
```

As the diagram shows, we can use keyword filtering right at the start of the retrieval process to quickly cull the search space. By filtering the documents in our `VectorStore` using metadata, we create a much smaller, more relevant set of candidates for the LLM to work with.

-----

## Key Takeaways

>   * **Core Idea:** Keyword search finds documents containing the **exact words** from a query. It's about literal string matching, not understanding meaning.
>   * **Main Weakness:** The **Lexical Gap**. It fails when the user's query uses different words (synonyms, etc.) than the document, leading to low recall (missed documents).
>   * **Main Strength:** High **precision** for specific identifiers (product codes, error messages) and exact phrases where wording is critical. It's also fast, simple, and interpretable.
>   * **Practical Implementation:** In LangChain, the most effective way to use keywords is by **filtering on metadata** within your `VectorStore` (`vector_store.similarity_search(query, filter={...})`).
>   * **Role in RAG:** It's rarely used alone for general Q\&A. Its power lies in **pre-filtering** a dataset or being combined with semantic search in a **hybrid** system.

-----

## Exercises & Thought Experiments

1.  **Metadata Filtering Practice:** Take the ChromaDB code example from today's lecture.

      * Add a new document with the content "The marketing budget for Q3 is focused on social media campaigns" and metadata `{"project_code": "MKT-003", "team": "Marketing", "status": "active", "quarter": "Q3"}`.
      * Write a query that finds all documents from the "Finance" or "Marketing" teams. (Hint: Look up how ChromaDB handles `OR` conditions in filters, or perform two separate searches).
      * Perform a search for "budget" but *filter out* anything from Project Phoenix.

2.  **The Lexical Gap Challenge:**

      * Create a single `Document` object where the `page_content` is: "Our firm's automobiles are high-performance vehicles, and we ensure every ride is inspected for quality."
      * Using the `simple_keyword_filter` `RunnableLambda` from our lecture, try to find this document using the keyword "car".
      * Observe that it fails. In a text block, explain exactly why it failed, using the term "lexical gap." Then, list three different keywords that *would* have successfully found this document.

3.  **When is Keyword King?** Brainstorm and write down three specific, real-world RAG application scenarios where keyword search would likely be *more* useful and reliable than a pure semantic search. Explain your reasoning for each. (e.g., A system for looking up legal precedents, a chatbot for checking order statuses using an order ID, etc.).